Causal Models

First published Tue Aug 7, 2018

Causal models are mathematical models representing causalrelationships within an individual system or population. Theyfacilitate inferences about causal relationships from statisticaldata. They can teach us a good deal about the epistemology ofcausation, and about the relationship between causation andprobability. They have also been applied to topics of interest tophilosophers, such as the logic of counterfactuals, decision theory,and the analysis of actual causation.

1. Introduction

Causal modeling is an interdisciplinary field that has its origin inthe statistical revolution of the 1920s, especially in the work of theAmerican biologist and statistician Sewall Wright (1921). Importantcontributions have come from computer science, econometrics,epidemiology, philosophy, statistics, and other disciplines. Given theimportance of causation to many areas of philosophy, there has beengrowing philosophical interest in the use of mathematical causalmodels. Two major works—Spirtes, Glymour, and Scheines 2000(abbreviated SGS), and Pearl 2009—have been particularlyinfluential.

A causal model makes predictions about the behavior of a system. Inparticular, a causal model entails the truth value, or theprobability, of counterfactual claims about the system; it predictsthe effects of interventions; and it entails the probabilisticdependence or independence of variables included in the model. Causalmodels also facilitate the inverse of these inferences: if we haveobserved probabilistic correlations among variables, or the outcomesof experimental interventions, we can determine which causal modelsare consistent with these observations. The discussion will focuson what it is possible to do in “in principle”. Forexample, we will consider the extent to which we can infer the correctcausal structure of a system, given perfect information about theprobability distribution over the variables in the system. Thisignores the very real problem of inferring the true probabilities fromfinite sample data. In addition, the entry will discuss the application ofcausal models to the logic of counterfactuals, the analysis ofcausation, and decision theory.

2. Basic Tools

This section introduces some of the basic formal tools used in causalmodeling, as well as terminology and notational conventions.

2.1 Variables, Logic, and Language

Variables are the basic building blocks of causal models.They will be represented by italicized upper case letters. A variableis a function that can take a variety of values. The values of avariable can represent the occurrence or non-occurrence of an event, arange of incompatible events, a property of an individual or of apopulation of individuals, or a quantitative value. For instance, wemight want to model a situation in which Suzy throws a stone and awindow breaks, and have variablesS andW such that:

$S = 1$ represents Suzy throwing a rock; $S = 0$ representsher not throwing
$W = 1$ represents the window breaking; $W = 0$ represents thewindow remaining intact.

If we are modeling the influence of education on income in the UnitedStates, we might use variablesE andI such that:

$E(i) = 0$ if individuali has no high school education;1 if an individual has completed high school; 2 if an individual hashad some college education; 3 if an individual has a bachelor’sdegree; 4 if an individual has a master’s degree; and 5 if anindividual as a doctorate (including the highest degrees in law andmedicine).
$I(i) = x$ if individuali has a pre-tax income of$x per year.

The set of possible values of a variable is therange of thatvariable. We will usually assume that variables have finitely manypossible values, as this will keep the mathematics and the expositionsimpler. However, causal models can also feature continuous variables,and in some cases this makes an important difference.

Aworld is a complete specification of a causal model; thedetails will depend upon the type of model. For now, we note that aworld will include,inter alia, an assignment of values toall of the variables in the model. If the variables represent theproperties of individuals in a population, a world will include anassignment of values to every variable, for every individual in thepopulation. A variable can then be understood as a function whosedomain is a set of worlds, or a set of worlds and individuals.

IfX is a variable in a causal model, andx is aparticular value in the range ofX, then $X = x$ is anatomic proposition. The logical operations of negation(“not”), conjunction (“and”), disjunction(“or”), the material conditional(“if…then…”), and the biconditional(“if and only if”) are represented by“${\sim}$”, “&”, “$\lor$”,“$\supset$”, and “$\equiv$”respectively. Any proposition built out of atomic propositions andthese logical operators will be called aBooleanproposition. Note that when the variables are defined over individualsin a population, reference to an individual isnot includedin a proposition; rather, the proposition as a whole is true or falseof the various individuals in the population.

We will use basic notation from set theory. Sets will appear inboldface.

$\mathbf{\varnothing}$ is the empty set (the set that has nomembers or elements).
$x \in \bX$ means thatx is a member or element of theset $\bX$.
$\bX \subseteq \bY$ means that $\bX$ is a subset of $\bY$;i.e., every member of $\bX$ is also a member of $\bY$. Note thatboth $\mathbf{\varnothing}$ and $\bY$ are subsets of $\bY$.
$\bX \setminus \bY$ is the set that results from removing themembers of $\bY$ from $\bX$.
$\forall$ and $\exists$ are the universal and existentialquantifiers, respectively.

If $\bS = \{x_1 , \ldots ,x_n\}$ is a set of values in the range ofX, then $X \in \bS$ is used as shorthand for the disjunctionof propositions of the form $X = x_i$, for $i = 1,\ldots$, n.Boldface representsordered sets orvectors. If$\bX = \{X_1 , \ldots ,X_n\}$ is a vector of variables, and $\bx =\{x_1 , \ldots ,x_n\}$ is a vector of values, with each value $x_i$in the range of the corresponding variable $X_i$, then $\bX = \bx$is the conjunction of propositions of the form $X_i = x_i$.

2.2 Probability

Insection 4, we will consider causal models that include probability. Probabilityis a function, P, that assigns values between zero and one, inclusive.The domain of a probability function is a set of propositions thatwill include all of the Boolean propositions described above, butperhaps others as well.

Some standard properties of probability are the following:

IfA is a contradiction, then $\Pr(A) = 0$.
IfA is a tautology, then $\Pr(A) = 1$.
If $\Pr(A \amp B) = 0$, then $\Pr(A \lor B) = \Pr(A) +\Pr(B)$.
$\Pr({\sim}A) = 1 - \Pr(A)$.

Some further definitions:

Theconditional probability of A given B, written $\Pr(A\mid B)$ is standardly defined as follows:
\[\Pr(A \mid B) = \frac{\Pr(A \amp B)}{\Pr(B)}. \]
We will ignore problems that might arise when $\Pr(B) = 0$.
A andB areprobabilistically independent(with respect to $\Pr$) just in case $\Pr(A \amp B) = \Pr(A) \times\Pr(B). A$ andB areprobabilistically dependent orcorrelated otherwise. If $\Pr(B) \gt 0$, thenA andB will be independent just in case $\Pr(A \mid B) =\Pr(A)$.
VariablesX andY are probabilistically independentjust in case all propositions of the form $X = x$ and $Y = y$ areprobabilistically independent.
A andB areprobabilistically independentconditional onC just in case $\Pr(A \amp B \mid C) =\Pr(A \mid C) \times \Pr(B \mid C)$. If $\Pr(B \amp C) \gt 0$, thisis equivalent to $\Pr(A \mid B \amp C) = \Pr(A \mid C)$. Followingthe terminology of Reichenbach (1956), we will also say thatC screens offB fromA when theseequalities hold. Conditional independence among variables is definedanalogously.

As a convenient shorthand, a probabilistic statement that containsonly a variable or set of variables, but no values, will be understoodas having a universal quantification over all possible values of thevariable(s). Thus if $\bX = \{X_1 ,\ldots ,X_m\}$ and $\bY = \{Y_1,\ldots ,Y_n\}$, we may write

\[\Pr(\bX \mid \bY) = \Pr(\bX)\]

as shorthand for

\[\begin{aligned} \forall x_1 \ldots \forall x_m\forall y_1 \ldots\forall y_n [\Pr(X_1\! =\! x_1 ,\ldots ,X_m\! =\! x_m \mid Y_1\! =\!y_1 ,\ldots ,Y_n\! =\! y_n ) \\ = \Pr(X_1\! =\! x_1 ,\ldots ,X_m\! =\! x_m ) ]\end{aligned}\]

where the domain of quantification for each variable will be the rangeof the relevant variable.

We will not presuppose any particular interpretation of probability(see the entry oninterpretations of probability), but we will assume that frequencies in appropriately chosen samplesprovide evidence about the underlying probabilities. For instance,suppose there is a causal model including the variablesEandI described above, with $\Pr(E = 3) = .25$. Then we expectthat if we survey a large, randomly chosen sample of American adults, wewill find that approximately a quarter of them have a Bachelor’sdegree, but no higher degree. If the survey produces a sample frequencythat substantially differs from this, we have evidence that the modelis inaccurate.

2.3 Graphs

If $\bV$ is the set of variables included in a causal model, one wayto represent the causal relationships among the variables in $\bV$is by agraph. Although we will introduce and use graphs insection 3, they will play a more prominent role insection 4. We will discuss two types of graphs. The first is thedirectedacyclic graph (DAG). Adirected graph $\bG$ onvariable set $\bV$ is a set of ordered pairs of variables in$\bV$. We represent this visually by drawing an arrow fromXtoY just in case $\langle X, Y\rangle$ is in $\bG$.Figure 1 shows a directed graph on variable set $\bV = \{S, T, W, X, Y,Z\}$.

a diagram where S has an arrow pointing north to T; T has an arrow pointing northwest to X and northeast to Y; Y has an arrow pointing northeast to Z; W has an arrow pointing north to Z and northwest to Y

Figure 1

Apath in a directed graph is a non-repeating sequence ofarrows that have endpoints in common. For example, inFigure 1 there is a path fromX toZ, which we can write as $X\leftarrow T \rightarrow Y \rightarrow Z$. Adirected pathis a path in which all the arrows point in the same direction; forexample, there is a directed path $S \rightarrow T \rightarrow Y\rightarrow Z$. A directed graph isacyclic, and hence aDAG, if there is no directed path from a variable to itself. Such adirected path is called acycle. The graph in Figure 1contains no cycles, and hence is a DAG.

The relationships in the graph are often described using the languageof genealogy. The variableX is aparent ofYjust in case there is an arrow fromX toY. $\bPA(Y)$will denote the set of all parents ofY. InFigure 1, $\bPA(Y) = \{T, W\}$.X is anancestor ofY(andY is adescendant of $X)$ just in case there isa directed path fromX toY. However, it will beconvenient to deviate slightly from the genealogical analogy anddefine “descendant” so that every variable is a descendantof itself. $\bDE(X)$ denotes the set of all descendants ofX.In Figure 1, $\bDE(T) = \{T,X, Y, Z\}$.

An arrow fromY toZ in a DAG represents thatYis adirect cause of $Z.$ Roughly, this means that thevalue ofY makes some causal difference for the value ofZ, and thatY influencesZ through some processthat is not mediated by any other variable in $\bV$. Directness isrelative to a variable set:Y may be a direct cause ofZrelative to variable set $\bV$, but not relative to variable set$\bV'$ that includes some additional variable(s) that mediate theinfluence ofY on $Z.$ As we develop our account of graphicalcausal models in more detail, we will be able to say more preciselywhat it means for one variable to be a direct cause of another. Whilewe will not define “cause”, causal models presuppose abroadlydifference-making notion of causation, rather than acausal process notion (Salmon 1984, Dowe 2000) or a mechanistic notion(Machamer, Darden, & Craver 2000; Glennan 2017). We will call thesystem of direct causal relations represented in a DAG such asFigure 1 thecausal structure on the variable set $\bV$.

A second type of graph that we will consider is anacyclicdirected mixed graph (ADMG). An ADMG will contain double-headedarrows, as well as single-headed arrows. A double-headed arrowrepresents alatent common cause. A latent common cause ofvariablesX andY is a common cause that is not includedin the variable set $\bV$. For example, suppose thatX andY share a common causeL (Figure 2(a)). An ADMG on the variable set $\bV = \{X, Y\}$ will look likeFigure 2(b).

diagram with L having an arrow pointing northwest to X and northeast to Y

(a)

X is to the right of Y and a double-headed curved arrow connects the two

(b)

Figure 2

We can be a bit more precise. We only need to represent missing commoncauses in this way when they areclosest common causes. Thatis, a graph on $\bV$ should contain a double-headed arrow betweenX andY when there is a variableL that isomitted from $\bV$, such that ifL were added to $\bV$ itwould be adirect cause ofX andY.

In an ADMG, we expand the definition of apath to includedouble-headed arrows. Thus, $X \leftrightarrow Y$ is a path in theADMG shown inFigure 2(b).Directed path retains the same meaning, and a directed pathcannot contain double-headed arrows.

We will adopt the convention that both DAGs and ADMGs represent thepresenceand absence of both direct causal relationships andlatent common causes. For example the DAG inFigure 1 represents thatW is a direct cause ofY, thatXis not a direct cause ofY, and thatthere are no latentcommon causes. The absence of double-headed arrows from Figure 1does not show merely that we have chosen not to include latent commoncauses in our representation; it shows that there are no latent commoncauses.

3. Deterministic Structural Equation Models

In this section, we introduce deterministicstructural equationmodels (SEMs), postponing discussion of probability untilSection 4. We will consider two applications of deterministic SEMs: the logic ofcounterfactuals, and the analysis of actual causation.

3.1 Introduction to SEMs

A SEM characterizes a causal system with a set of variables, and a setof equations describing how each variable depends upon its immediatecausal predecessors. Consider a gas grill, used to cook meat. We candescribe the operations of the grill using the followingvariables:

Gas connected (1 if yes, 0 if no)
Gas knob (0 for off, 1 for low, 2 for medium, 3 forhigh)
Gas level (0 for off, 1 for low, 2 for medium, 3 forhigh)
Igniter (1 if pressed, 0 if not)
Flame (0 for off, 1 for low, 2 for medium, 3 forhigh)
Meat on (0 for no, 1 for yes)
Meat cooked (0 for raw, 1 for rare, 2 for medium, 3 forwell done)

Thus, for example,Gas knob = 1 means that the gas knob isset to low;Igniter = 1 means that the igniter is pressed,and so on. Then the equations might be:

Gas level =Gas connected $\times$Gasknob
Flame =Gas level $\times$Igniter
Meat cooked =Flame $\times$Meaton

The last equation, for example, tells us that if the meat is not puton the grill, it will remain raw (Meat cooked = 0). If themeat is put on the grill, then it will get cooked according to thelevel of the flame: if the flame is low (Flame = 1), the meatwill be rare (Meat cooked = 1), and so on.

By convention each equation has one effect variable on the left handside, and one or more cause variables on the right hand side. We alsoexclude from each equation any variable that makes no difference tothe value of the effect variable. For example, the equation forGas level could be written asGas level = (Gasconnected $\times$Gas knob) $+$ (0 $\times$Meatcooked); but since the value ofMeat cooked makes nodifference to the value ofGas level in this equation, weomit the variableMeat cooked. A SEM isacyclic ifthe variables can be ordered so that variables never appear on theleft hand side of an equation after they have appeared on the right.Our example is acyclic, as shown by the ordering of variables givenabove. In what follows, we will assume that SEMs are acyclic, unlessstated otherwise.

We can represent this system of equations as a DAG (Figure 3):

diagram: 'Gas connected' has an arrow pointing northeast to 'Gas level'; 'Gas knob' has an arrow pointing northwest to the same 'Gas level'; 'Gas level' has an arrow pointing northeast to 'Flame'; 'Igniter' has an arrow pointing northwest to the same 'Flame'; 'Flame' has an arrow pointing northeast to 'Meat cooked'; 'Meat on' has an arrow pointing northwest to the same 'Meat cooked'

Figure 3

An arrow is drawn from variableX to variableYjust in caseX figures as an argument in the equation forY. The graph contains strictly less information than the set ofequations; in particular, the DAG gives us qualitative informationabout which variables depend upon which others, but it does not tellus anything about the functional form of the dependence.

The variables in a model will typically depend upon further variablesthat are not explicitly included in the model. For instance, the levelof the flame will also depend upon the presence of oxygen. Variables thatare not explicitly represented in the model are assumed to be fixed atvalues that make the equations appropriate. For example, in our modelof the gas grill, oxygen is assumed to be present in sufficientquantity to sustain a flame ranging in intensity from low to high.

In our example, the variablesGas level,Flame, andMeat cooked areendogenous, meaning that theirvalues are determined by other variables in the model.Gasconnected, Gas knob, Igniter, andMeat on areexogenous, meaning that their values are determined outsideof the system. In all of the models that we will consider insection 3, the values of the exogenous variables are given or otherwiseknown.

Following Halpern (2016), we will call an assignment of values to theexogenous variables acontext. In an acyclic SEM, a contextuniquely determines the values of all the variables in the model. Anacyclic SEM together with a context is aworld (what Halpern2016 calls a “causal setting”). For instance, if we addthe setting

Gas connected = 1
Gas knob = 3
Igniter = 1
Meat on = 1

to our three equations above, we get a world in whichGaslevel = 3,Flame = 3, andMeat cooked = 3.

The distinctively causal or “structural” content of a SEMderives from the way in whichinterventions are represented.To intervene on a variable is to set the value of that variable by aprocess that overrides the usual causal structure, without interferingwith the causal processes governing the other variables. Moreprecisely, an intervention on a variableX overrides the normalequation forX, while leaving the other equations unchanged.For example, to intervene on the variableFlame in ourexample would be to set the flame to a specified level regardless ofwhether the igniter is pressed or what the gas level is. (Perhaps, forexample, one could pour kerosene into the grill and light it with amatch.) Woodward (2003) proposes that we think of an intervention as acausal process that operates independently of the other variables inthe model. Randomized controlled trials aim to intervene in thissense. For example, a randomized controlled trial to test the efficacyof a drug for hypertension aims to determine whether each subjecttakes the drug (rather than a placebo) by a random process such as acoin flip. Factors such as education and health insurance thatnormally influence whether someone takes the drug no longer play thisrole for subjects in the trial population. Alternately, we couldfollow the approach of Lewis (1979) and think of an interventionsetting the value of a variable by a minor “miracle”.

To represent an intervention on a variable, wereplace theequation for that variable with a new equation stating the value towhich the variable is set. For example, if we intervene to set thelevel of flame atlow, we would represent this by replacingthe equationFlame =Gas level $\times$Igniter withFlame = 1. This creates a new causalstructure in whichFlame is an exogenous variable;graphically, we can think of the intervention as “breaking thearrows” pointing intoFlame. The new system ofequations can then be solved to discover what values the othervariables would take as a result of the intervention. In the worlddescribed above, our intervention would produce the following set ofequations:

Gas connected = 1
Gas knob = 3
Igniter = 1
Meat on = 1
Gas level =Gas connected ×Gasknob
Flame =Gas level ×Igniter
Flame = 1
Meat cooked =Flame ×Meaton

We have struck through the original equation forFlame toshow that it is no longer operative. The result is a new world with amodified causal structure, withGas level = 3,Flame= 1, andMeat cooked = 1. Since the equation connectingFlame to its causes is removed, any changes introduced bysettingFlame to 1 will only propagate forward through themodel to the descendants ofFlame. The intervention changesthe values ofFlame andMeat cooked, but it does notaffect the values of the other variables. We can representinterventions on multiple variables in the same way, replacing theequations for all of the variables intervened on.

Interventions help to give content to the arrows in the correspondingDAG. If variable $X_i$ is a parent of $X_j$, this means that thereexists some setting for all of the other variables in the model, suchthat when we set those variables to those values by means of anintervention, intervening on $X_i$ can still make a difference forthe value of $X_j$. For example, in our original model,Gaslevel is a parent ofFlame. If we set the value ofIgniter to 1 by means of an intervention, and setGasknob, Gas connected, Meat on, andMeat cooked to anyvalues at all, then intervening on the value ofGas levelmakes a difference for the value ofFlame. Setting the valueofGas level to 1 would yield a value of 1 forFlame; settingGas level to 2 yields aFlame of 2; and so on.

3.2 Structural Counterfactuals

A counterfactual is a proposition in the form of a subjunctiveconditional. The antecedent posits some circumstance, typically onethat is contrary to fact. For example, in our gas grill world, theflame was high, and the meat was well done. We might reason: “ifthe flame had been set to low, the meat would have been rare”.The antecedent posits a hypothetical state of affairs, and theconsequent describes what would have happened in that hypotheticalsituation.

Deterministic SEMs naturally give rise to a logic of counterfactuals.These counterfactuals are calledstructural counterfactualsorinterventionist counterfactuals. Structuralcounterfactuals are similar in some ways to what Lewis (1979) callsnon-backtracking counterfactuals. In a non-backtrackingcounterfactual, one does not reason backwards from a counterfactualsupposition to draw conclusions about the causes of the hypotheticalsituation. For instance, one would not reason “If the meat hadbeen cooked rare, then the flame would have been set to low”.Lewis (1979) proposes that we think of the antecedent of acounterfactual as coming about through a minor “miracle”.The formalism for representing interventions described in the previoussection prevents backtracking from effects to causes.

The logic of structural counterfactuals has been developed by Gallesand Pearl (1998), Halpern (2000), Briggs (2012), and Zhang(2013a). This section will focus on Briggs’ formulation; it hasthe richest language, but unlike the other approaches it can not beapplied to causal models with cycles. Despite a shared concern withnon-backtracking counterfactuals, Briggs’ logic differs in anumber of ways from the more familiar logic of counterfactualsdeveloped by Stalnaker (1968) and Lewis (1973b).

We interpret the counterfactual conditional $A \boxright B$ as sayingthatB would be true, ifA were made true by anintervention. The language of structural counterfactuals does notallow the connective ‘$A \boxright B$’ to appear in the antecedents ofcounterfactuals. More precisely, we define well-formed formulas(wffs) for the language inductively:

Boolean propositions arewffs
IfA is a Boolean proposition, andB is awff, then $A \boxright B$ is awff

This means, for example, that $A \boxright (B\boxright (C\boxright D))$is awff, but $A\boxright ((B\boxright C)\boxright D)$ is not,since the embedded counterfactual in the consequent does not have aBoolean proposition as an antecedent.

Consider the world of the gas grill, described in the previoussection:

Gas connected = 1
Gas knob = 3
Igniter = 1
Meat on = 1
Gas level =Gas connected $\times$Gasknob
Flame =Gas level $\times$Igniter
Meat cooked =Flame $\times$Meaton

To evaluate the counterfactual ${\textit{Flame} = 1} \boxright{\textit{Meat cooked} = 1}$ (if the flame had been set tolow, the meat would have been cooked rare), we replace theequation forFlame with the assignmentFlame = 1. Wecan then compute thatMeat cooked = 1; the counterfactual istrue. If the antecedent is a conjunction of atomic propositions, suchasFlame = 1 andIgniter = 0, we replace all of therelevant equations. A special case occurs when the antecedent conjoinsatomic propositions that assign different values to the same variable,such asFlame = 1 andFlame = 2. In this case, theantecedent is a contradiction, and the counterfactual is consideredtrivially true.

If the antecedent is a disjunction of atomic propositions, or adisjunction of conjunctions of atomic propositions, then theconsequent must be true whenevery possible intervention orset of interventions described by the antecedent is performed.Consider, for instance,

\[\begin{aligned}(({\textit{Flame}= 1} \amp {\textit{Gas level}= 0}) \lor ({\textit{Flame}= 2} \amp {\textit{Meat on}= 0}))\\{} \boxright ({\textit{Meat cooked}= 1} \lor {\textit{Meat cooked}= 2}).\end{aligned}\]

If we perform the first intervention, we compute thatMeatcooked = 1, so the consequent is true. However, if we perform thesecond intervention, we compute thatMeat cooked = 0. Hencethe counterfactual comes out false. Some negations are treated asdisjunctions for this purpose. For example, ${\sim}(\textit{Flame} =1)$ would be treated in the same way as the disjunction

\[{\textit{Flame} = 0} \lor {\textit{Flame} = 2} \lor {\textit{Flame} = 3}.\]

If the consequent contains a counterfactual, we iterate the procedure.Consider the counterfactual:

\[\begin{align}{\textit{Flame} = 1}&\ \boxright\ ({\textit{Gas level} = 0}\ \boxright \\& ({\textit{Flame} = 2} \boxright {\textit{Meat cooked} = 1})).\end{align}\]

To evaluate this counterfactual, we first change the equation forFlame toFlame = 1. Then we change the equation forGas level toGas level = 0. Then we change theequation forFlame again, fromFlame = 1, toFlame = 2. Finally, we compute thatMeat cooked = 2,so the counterfactual comes out false. Unlike the case whereFlame = 1 andFlame = 2 are conjoined in theantecedent, the two different assignments forFlame do notgenerate an impossible antecedent. In this case, the interventions areperformed in a specified order:Flame is first set to 1, andthen set to 2.

The differences between structural counterfactuals and Stalnaker-Lewiscounterfactuals stem from the following two features of structuralcounterfactuals:

The antecedent of a counterfactual is always thought of as beingrealized by an intervention, even if the antecedent is already true ina given world. For instance, in our gas grill world,Flame =3. Nonetheless, if we evaluate a counterfactual with antecedentFlame = 3 in this world, we replace the equation forFlame withFlame = 3.
The truth values of counterfactuals are determined solely by thecausal structures of worlds, together with the interventions specifiedin the their antecedents. No further considerations ofsimilarity play a role. For example, the counterfactual
\[ {\textit{Flame}= 1} \lor {\textit{Flame} = 2} \boxright {\textit{Flame}= 2} \]
would be false in our gas grill world (and indeed in all possibleworlds). We do not reason that a world in whichFlame = 2 iscloser to our world (in whichFlame = 3) than aworld in whichFlame = 1.

These features of structural counterfactuals lead to some unusualproperties in the full language developed by Briggs (2012):

The analog ofmodus ponens fails for the structuralconditional; i.e., fromA and $A\boxright B$ we cannot inferB. For example, in our gas grill world,Flame = 3 and\[ {\textit{Flame} = 3} \boxright {({\textit{Gas level} = 2} \boxright {\textit{Meat cooked} = 3})} \] are both true, but \[{\textit{Gas level} = 2} \boxright {\textit{Meat cooked} = 3} \] is false. To evaluate thelast counterfactual, we replace the equation forGas levelwithGas level = 2. This results inFlame = 2 andMeat cooked = 2. To evaluate the prior counterfactual, wefirst substitute in the equationFlame = 3. Now, the value ofFlame no longer depends upon the value ofGas level,so when we substitute inGas level = 2, we getMeatcooked = 3. Even though the actual value ofFlame is 3,we evaluate the counterfactual by intervening onFlame to setit to 3. With this intervention in place, a further intervention onGas level makes no difference toFlame orMeatcooked.
The substitution of logically equivalent propositions into theantecedent of a counterfactual does not always preserve truth value.For example, \[{\textit{Gas level} = 2} \boxright {\textit{Meat cooked} = 2} \] is true, but \[\begin{align}& ({\textit{Gas level} = 2} \amp ({\textit{Flame} = 2} \lor {{\sim}(\textit{Flame} =2))}\\& \qquad \boxright\, {\textit{Meat cooked} = 2} \end{align}\] is false. Thefirst counterfactual does not require us to intervene on the value ofFlame, but the second counterfactual requires us to considerinterventions that setFlame to all of its possiblevalues.

To handle the second kind of case, Briggs (2012) defines a relation ofexactequivalence among Boolean propositions usingthe state space semantics of Fine (2012). Within a world, the statethat makes a proposition true is the collection of values of variablesthat contribute to the truth of the proposition. In our example world,the state that makesGas level = 3 true is the valuationGas level = 3. By contrast, the state that makes

\[\textit{Gas level} = 3 \amp (\textit{Flame} = 2 \lor {\sim}(\textit{Flame} = 2))\]

true includes bothGas level = 3 andFlame = 3.Propositions are exactly equivalent if they are made true by the samestates in all possible worlds. The truth value of a counterfactual ispreserved when exactly equivalent propositions are substituted intothe antecedent.

Briggs (2012) provides a sound and complete axiomatization forstructural counterfactuals in acyclic SEMs. The axioms and inferencerules of this system are presented inSupplement on Briggs’ Axiomatization.

3.3 Actual Causation

Many philosophers and legal theorists have been interested in therelation ofactual causation. This concerns the assignment ofcausal responsibility for some event that occurs, based on how eventsactually play out. For example, suppose that Billy and Suzy are bothholding rocks. Suzy throws her rock at a window, but Billy does not.Suzy’s rock hits the window, which breaks. Then Suzy’sthrow was the actual cause of the window breaking.

We can represent this story easily enough with a structural equationmodel. For variables, we choose:

$B = 1$ if Billy throws, 0 if he doesn’t
$S = 1$ if Suzy throws, 0 if she doesn’t
$W = 1$ if the window shatters, 0 if it doesn’t

Our context and equation will then be:

$B = 0$
$S = 1$
$W = \max(B, S)$

The equation forW tells us that the window would shatter ifeither Billy or Suzy throws their rock. The corresponding DAG is showninFigure 4

diagram: B has an arrow pointing northeast to W and S has an arrow pointing northwest to the same W

Figure 4

But we cannot simply read off the relation of actual causationfrom the graph or from the equations. For example, the arrow fromB toW inFigure 4 cannot be interpreted as saying that Billy’s (in)action is anactual cause of the window breaking. Note that while it is common todistinguish between singular or token causation, and general ortype-level causation (see, e.g., Eells 1991, Introduction), that isnot what is at issue here. Our causal model does not represent anykind of causal generalization: it represents the actual and possibleactions of Billy and Suzy at one particular place and time. Actualcausation is not just causal structure of the single case. A furthercriterion for actual causation, defined in terms of the causalstructure together with the actual values of the variables, isneeded.

Following Lewis (1973a), it is natural to try to analyze the relationof actual causation in terms ofcounterfactual dependence. Inour model, the following propositions are all true:

$S = 1$
$W = 1$
${S = 0}\boxright {W = 0}$

In words: Suzy threw her rock, the window shattered, and if Suzyhadn’t thrown her rock, the window wouldn’t haveshattered. In general, we might attempt to analyze actual causation asfollows:

$X = x$ is an actual cause of $Y = y$ in worldw just incase:

X andY are different variables
$X = x$ and $Y = y$ inw
There exist $x' \ne x$ and $y' \ne y$ such that ${X = x'}\boxright {Y = y'}$ is true inw

Unfortunately, this simple analysis will not work, for familiarreasons involvingpreemption andoverdetermination.Here is an illustration of each:

Preemption: Billy decides that he will give Suzy theopportunity to throw first. If Suzy throws her rock, he will notthrow, but if she doesn’t throw, he will throw and his rock willshatter the window. In fact, Suzy throws her rock, which shatters thewindow. Billy does not throw.

Overdetermination: Billy and Suzy throw their rockssimultaneously. Their rocks hit the window at the same time,shattering it. Either rock by itself would have been sufficient toshatter the window.

In both of these cases, Suzy’s throw is an actual causeof the window’s shattering, but the shattering does notcounterfactually depend upon her throw: if Suzy hadn’t thrownher rock, Billy’s rock would have shattered the window. Much ofthe work on counterfactual theories of causation since 1973 has beendevoted to addressing these problems.

A number of authors have used SEMs to try to formulate adequateanalyses of actual causation in terms of counterfactuals, includingBeckers & Vennekens (2018), Glymour & Wimberly (2007), Halpern(2016), Halpern & Pearl (2001, 2005), Hitchcock (2001), Pearl(2009: Chapter 10), Weslake (forthcoming), and Woodward (2003: Chapter2). As an illustration, consider one analysis based closely ona proposal presented in Halpern (2016):

(AC) $X = x$ is an actual cause of $Y = y$ inworldw just in case:

X andY are different variables
$X = x$ and $Y = y$ inw
There exist disjoint sets of variables $\bX$ and $\bZ$ with$X \in \bX$, with values $\bX = \bx$ and $\bZ = \bz$ inw, such that:
1. There exists $\bx' \ne \bx$ such that \[({\bX = \bx'} \amp {\bZ = \bz}) \boxright {Y \ne y}\] is true inw
2. No subset of $\bX$ satisfies (3a)

That is,X belongs to a minimal set of variables $\bX$, suchthat when we intervene to hold the variables in $\bZ$ fixed at thevalues they actually take inw,Y counterfactuallydepends upon the values of the variables in $\bX.$ We willillustrate this account with our examples of preemption andoverdetermination.

InPreemption, let the variablesB,S, andW be defined as above. Our context and equations are:

$S = 1$
$B = 1 - S$
$W = \max(B, S)$

That is: Suzy throws her rock; Billy will throw his rock if Suzydoesn’t; and the window will shatter if either throws theirrock. The DAG is shown inFigure 5.

diagram: B has an arrow pointing northeast to W and S has an arrow pointing northwest to the same W. S also has an arrow pointing west to the same B.

Figure 5

We want to show that $S = 1$ is an actual cause of $W = 1$.Conditions AC(1) and AC(2) are clearly satisfied. For condition AC(3),we choose $\bX = \{S\}$ and $\bZ = \{B\}$. Since $B = 0$ inPreemption, we want to fix $B = 0$ while varyingS.We can see easily that ${S = 0} \amp {B = 0} \boxright {W = 0}$:replacing the two equations forB andS with $B = 0$and $S = 0$, the solution yields $W = 0$. In words, thiscounterfactual says that if neither Billy nor Suzy had thrown theirrock, the window would not have shattered. Thus condition AC(3a) issatisfied. AC(3b) is satisfied trivially, since $\bX = \{S\}$ is asingleton set.

Here is how AC works in this example.S influencesWalong two different paths: the direct path $S \rightarrow W$ and theindirect path $S \rightarrow B \rightarrow W$. These two pathsinteract in such a way that they cancel each other out, and the valueofS makes no net difference to the value ofW. However,by holdingB fixed at its actual value of 0, we eliminate theinfluence ofS onW along that path. The result is thatwe isolate the contribution thatS made toW along thedirect path. AC defines actual causation as a particular kind ofpath-specific effect.

To treatOverdetermination, letB,S, andW keep the same meanings. Our setting and equation will be:

$B = 1$
$S = 1$
$W = \max(B, S)$

The graph is the same as that shown inFigure 4 above. Again, we want to show that $S = 1$ is an actual cause of$W = 1$. Conditions AC(1) and AC(2) are obviously satisfied. ForAC(3), we choose $\bX = \{B, S\}$ and $\bZ = \varnothing$. Forcondition AC(3a), we choose for our alternate setting $\bX = \bx'$ $B =0$ and $S = 0$. Once again, the counterfactual ${S = 0} \amp {B =0} \boxright {W = 0}$ is true. Now, for AC(3b) we must show that $\bX= \{B, S\}$ is minimal. It is easy to check that $\{B\}$ alonewon’t satisfy AC(3a). Whether we take $\bZ = \varnothing$ or$\bZ = \{S\}$, changingB to 0 (perhaps while also settingS to 1) will not change the value ofW. A parallelargument shows that $\{S\}$ alone won’t satisfy AC(3a) either.The key idea here is thatS is a member of a minimal set ofvariables that need to be changed in order to change the value ofW.

Despite these successes, none of the analyses of actual causationdeveloped so far perfectly captures our pre-theoretic intuitions inevery case. One strategy that has been pursued by a number of authorsis to incorporate some distinction betweendefault anddeviant values of variables, or betweennormal andabnormal conditions. See, e.g., Hall (2007), Halpern (2008;2016: Chapter 3), Halpern & Hitchcock (2015), Hitchcock (2007),and Menzies (2004). Blanchard & Schaffer (2017) present argumentsagainst this approach. Glymour et al. (2010) raise a number ofproblems for the project of trying to analyze actual causation.

4. Probabilistic Causal Models

In this section, we will discuss causal models that incorporateprobability in some way. Probability may be used to represent ouruncertainty about the value of unobserved variables in a particularcase, or the distribution of variable values in a population. Often weare interested in when some feature of the causal structure of asystem can beidentified from the probability distributionover values of variables, perhaps in conjunction with backgroundassumptions and other observations. For example, we may know theprobability distribution over a set of variables $\bV$, and want toknow which causal structures over the variables in $\bV$ arecompatible with the distribution. In realistic scientific cases, wenever directly observe the true probability distribution P over a setof variables. Rather, we observe finite data that approximate the trueprobability when sample sizes are large enough and observationprotocols are well-designed. We will not address these importantpractical concerns here. Rather, our focus will be on what it ispossible to infer from probabilities, in principle if not in practice.We will also consider the application of probabilistic causal modelsto decision theory and counterfactuals.

4.1 Structural Equation Models with Random Errors

We can introduce probability into a SEM by means of a probabilitydistribution over the exogenous variables.

Let $\bV = \{X_1, X_2 ,\ldots ,X_n\}$ be a set of endogenousvariables, and $\bU = \{U_1, U_2 ,\ldots ,U_n\}$ a corresponding setof exogenous variables. Suppose that each endogenous variable $X_i$is a function of its parents in $\bV$ together with $U_i$, thatis:

\[X_i = f_i (\bPA(X_i), U_i).\]

As a general rule, our graphical representation of this SEM willinclude only the endogenous variables $\bV$, and we use$\bPA(X_i)$ to denote the set ofendogenous parents of$X_i . U_i$ is sometimes called anerror variable for$X_i$: it is responsible for any difference between the actual valueof $X_i$ and the value predicted on the basis of $\bPA(X_i)$alone. We may think of $U_i$ as encapsulating all of the causes of$X_i$ that are not included in $\bV$. The assumption that eachendogenous variable has exactly one error variable is innocuous. Ifnecessary, $U_i$ can be a vector of variables. For example, if$Y_1$, $Y_2$, and $Y_3$ are all causes of $X_i$ that are notincluded in $\bV$, we can let $U_i = \langle Y_1, Y_2,Y_3\rangle$. Moreover, the error variables need not be distinct orindependent from one another.

Assuming that the system of equations is acyclic, an assignment ofvalues to the exogenous variables $U_1$, $U_2$,… ,$U_n$uniquely determines the values of all the variables in the model.Then, if we have a probability distribution $\Pr'$ over the valuesof variables in $\bU$, this will induce a unique probabilitydistribution P on $\bV$.

4.2 The Markov Condition

Suppose we have a SEM with endogenous variables $\bV$, exogenousvariables $\bU$, probability distribution P on $\bU$ and $\bV$as described in the previous section, and DAG $\bG$ representing thecausal structure on $\bV$. Pearl and Verma (1991) prove that if theerror variables $U_i$ are probabilistically independent in P, thenthe probability distribution on $\bV$ will satisfy theMarkovCondition (MC) with respect to $\bG$. The Markov Condition hasseveral formulations, which are equivalent when $\bG$ is a a DAG(Pearl 1988):

(MC_{Screening_off})	For every variableX in $\bV$, and every set of variables$\bY \subseteq \bV \setminus \bDE(X)$,$\Pr(X \mid \bPA(X) \amp \bY) = \Pr(X \mid \bPA(X))$.
(MC_{Factorization})	Let $\bV = \{X_1, X_2 , \ldots ,X_n\}$. Then$\Pr(X_1, X_2 , \ldots ,X_n) = \prod_i \Pr(X_i \mid \bPA(X_i))$.
(MC_d-separation)	Let $X, Y \in \bV, \bZ \subseteq \bV \setminus \{X, Y\}$. Then$\Pr(X, Y \mid \bZ) = \Pr(X \mid \bZ) \times \Pr(Y \mid\bZ)$if $\bZ$d-separatesX andY in $\bG$(explained below).

Let us take some time to explain each of these formulations.

MC_{Screening_off} says that the parents of variableXscreenX off from all other variables, except for thedescendants ofX. Given the values of the variables that areparents ofX, the values of the variables in $\bY$ (whichincludes no descendants of $X)$, make no further difference to theprobability thatX will take on any given value.

MC_{Factorization} tells us that once we know the conditionalprobability distribution of each variable given its parents, $\Pr(X_i\mid \bPA(X_i))$, we can compute the complete joint distribution overall of the variables. It is relatively easy to see that MC_{Factorization} follows from MC_{Screening_off}. Since $\bG$is acyclic, we may re-label the subscripts on the variables so thatthey are ordered from ‘earlier’ to ‘later’,with only earlier variables being ancestors of later ones. It followsfrom the probability calculus that \[\begin{align}\Pr(X_1, X_2 , &\ldots ,X_n)\ = \\&\Pr(X_1) \times \Pr(X_2 \mid X_1) \times \ldots \times \Pr(X_n \mid X_1, X_2 , \ldots ,X_{n-1})\end{align}\] (this is a version ofthe theorem of total probability). For each term $\Pr(X_i \mid X_1,X_2 , \ldots ,X_{i-1})$, our ordering ensures that all of the parentsof $X_i$ will be included on the right hand side, and none of itsdescendants will. MC_{Screening_off} then tells us that we caneliminate all of the terms from the right hand side except for theparents of $X_i$.

MC_d-separation introduces the graphical notion ofd-separation. As noted above, a path fromX toYis a sequence of variables $\langle X = X_1 , \ldots ,X_k =Y\rangle$ such that for each $X_i$, $X_{i+1}$, there is either anarrow from $X_i$ to $X_{i+1}$or an arrow from $X_{i+1}$ to$X_i$ in $\bG$. A variable $X_i , 1 \lt i \lt k$ is acollider on the path just in case there is an arrow from$X_{i-1}$ to $X_i$ and from $X_{i+1}$ to $X_i$. In otherwords, $X_i$ is a collider just in case the arrows converge on$X_i$ in the path. Let $\bX, \bY$, and $\bZ$ be disjoint subsetsof $\bV$. $\bZ$d-separates $\bX$ and $\bY$ just in case everypath $\langle X_1 , \ldots ,X_k\rangle$ from a variable in $\bX$to a variable in $\bY$ contains at least one variable $X_i$ suchthat either: (i) $X_i$ is a collider, and no descendant of $X_i$(including $X_i$ itself) is in $\bZ$; or (ii) $X_i$ is not acollider, and $X_i$ is in $\bZ$. Any path that meets thiscondition is said to beblocked by $\bZ$. If $\bZ$ doesnotd-separate $\bX$ and $\bY$, then $\bX$ and $\bY$ared-connected by $\bZ$.

Note that MC provides sufficient conditions for variables to beprobabilistically independent, conditional on others, but no necessarycondition.

Here are some illustrations:

diagram: T has an arrow pointing north to W; W has a long arrow pointing northeast to Z; W also has an arrow pointing northwest to X; X has an arrow pointing northwest to Y

Figure 6

InFigure 6, MC implies thatX screensY off from all of the othervariables, andW screensZ off from all of the othervariables. This is most easily seen from MC_{Screening_off}.W also screensT off from all of the other variables,which is most easily seen from MC_d-separation.T does not necessarily screenY off fromZ (orindeed anything from anything).

diagram: X has an arrow pointing northeast to Y and Z has an arrow pointing northwest to the same Y.

Figure 7

InFigure 7, MC entails thatX andZ will be unconditionallyindependent, but not that they will be independent conditional onY. This is most easily seen from MC_d-separation.

Let $V_i$ and $V_j$ be two distinct variables in $\bV$, withcorresponding exogenous error variables $U_i$ and $U_j$,representing causes of $V_i$ and $V_j$ that are excluded from the$\bV$. Suppose $V_i$ and $V_j$ share at least one common causethat is excluded from $\bV$. In this case, we would not expect$U_i$ and $U_j$ to be probabilistically independent, and thetheorem of Pearl and Verma (1991) would not apply. In this case, thecausal relationship among the variables in $\bV$ would not beappropriately represented by a DAG, but would require an acyclicdirected mixed graph (ADMG) with a double-headed arrow connecting$V_i$ and $V_j$. We will discuss this kind of case in more detailinSection 4.6 below.

MC is not expected to hold for arbitrary sets of variables $\bV$,even when the DAG $\bG$ accurately represents the causal relationsamong those variables. For example, (MC) will typically fail in thefollowing kinds of case:

In an EPR (Einstein-Podolski-Rosen) set-up, we have two particlesprepared in the singlet state. IfX represents a spinmeasurement on one particle,Y a spin measurement (in the samedirection) on the other, thenX andY are perfectlyanti-correlated. (One particle will be spin-up just in case the otheris spin-down.) The measurements can be conducted sufficiently far awayfrom each other that it is impossible for one outcome to causallyinfluence the other. However, it can be shown that there is no (local)common causeZ that screens off the two measurementoutcomes.
The variables in $\bV$ are not appropriately distinct. Forexample, suppose thatX,Y, andZ are variablesthat are probabilistically independent and causally unrelated. Nowdefine $U = X + Y$ and $W = Y + Z$, and let $\bV = \{U, W\}$.ThenU andW will be probabilistically dependent, eventhough there is no causal relation between them.
MC may fail if the variables are too coarsely grained. SupposeX,Y, andZ are quantitative variables.Zis a common cause ofX andY, and neitherX norY causes the other. Suppose we replaceZ with a coarservariable, $Z'$ indicating only whetherZ is high or low. Thenwe would not expect $Z'$ to screenX off fromY. Thevalue ofX may well contain information about the value ofZ beyond what is given by $Z'$, and this may affect theprobability ofY.

Both SGS (2000) and Pearl (2009) contain statements of a principlecalled theCausal Markov Condition (CMC). The statements arein fact quite different from one another. In Pearl’sformulation, (CMC) is just a statement of the mathematical theoremdescribed above: If each variable in $\bV$ is a deterministicproduct of its parents in $\bV$, together with an error term; andthe errors are probabilistically independent of each other; then theprobability distribution on $\bV$ will satisfy (MC) with respect tothe DAG $\bG$ representing the functional dependence relations amongthe variables in $\bV$. Pearl interprets this result in thefollowing way: Macroscopic systems, he believes, are deterministic. Inpractice, however, we never have access to all of the causallyrelevant variables affecting a macroscopic system. But if we includeenough variables in our model so that the excluded variables areprobabilistically independent of one another, then our model willsatisfy the MC, and we will have a powerful set of analytic tools forstudying the system. Thus MC characterizes a point at which we haveconstructed a useful approximation of the complete system.

In SGS (2000), the (CMC) has more the status of an empirical posit. If$\bV$ is set of macroscopic variables that are well-chosen, meaningthat they are free from the sorts of defects described above; $\bG$is a DAG representing the causal structure on $\bV$; and P is theempirical probability distribution resulting from this causalstructure; then P can be expected to satisfy MC relative to $\bG$.They defend this assumption in (at least) two ways:

Empirically, it seems that a great many systems do in fact satisfyMC.
Many of the methods that are in fact used to detect causalrelationships tacitly presuppose the MC. In particular, the use ofrandomized trials presupposes a special case of the MC. Suppose thatan experimenter determines randomly which subjects will receivetreatment with a drug $(D = 1)$ and which will receive a placebo$(D = 0)$, and that under this regimen, treatment isprobabilistically correlated with recovery $(R)$. The effect ofrandomization is to eliminate all of the parents ofD, so MCtells us that ifR is not a descendant ofD, thenR andD should be probabilistically independent. If wedo not make this assumption, how can we infer from the experiment thatD is a cause ofR?

Cartwright (1993, 2007: chapter 8) has argued that MC need not holdfor genuinely indeterministic systems. Hausman and Woodward (1999,2004) attempt to defend MC for indeterministic systems.

A causal model that comprises a DAG and a probability distributionthat satisfies MC is called acausal Bayes net.

4.3 The Minimality and Faithfulness Conditions

The MC states a sufficient condition but not a necessary condition forconditional probabilistic independence. As such, the MC by itself cannever entail that two variables are conditionally or unconditionallydependent. The Minimality and Faithfulness Conditions are twoconditions that give necessary conditions for probabilisticindependence. (This is employing the terminology of Spirtes etal. (SGS 2000). Pearl (2009) contains a “MinimalityCondition” that is slightly different from the one describedhere.)

(i)The Minimality Condition. Suppose that the DAG $\bG$ onvariable set $\bV$ satisfies MC with respect to the probabilitydistribution P. The Minimality Condition asserts that no sub-graph of$\bG$ over $\bV$ also satisfies the Markov Condition with respectto P. As an illustration, consider the variable set $\{X, Y\}$, letthere be an arrow fromX toY, and suppose thatXandY are probabilistically independent of each other. Thisgraph would satisfy the MC with respect to P: none of the independencerelations mandated by the MC are absent (in fact, the MC mandates noindependence relations). But this graph would violate the MinimalityCondition with respect to P, since the subgraph that omits the arrowfromX toY would also satisfy the MC. The MinimalityCondition implies that if there is an arrow fromX toY,thenX makes a probabilistic difference forY,conditional on the other parents ofY. In other words, if $\bZ= \bPA(Y) \setminus \{X\}$, there exist $\bz$,y,x,$x'$ such that $\Pr(Y = y \mid X = x \amp \bZ = \bz) \ne \Pr(Y = y\mid X = x' \amp \bZ = \bz)$.

(ii)The Faithfulness Condition. The Faithfulness Condition(FC) is the converse of the Markov Condition: it says that all of the(conditional and unconditional) probabilistic independencies thatexist among the variables in $\bV$ arerequired by the MC.For example, suppose that $\bV = \{X, Y, Z\}$. Suppose also thatX andZ are unconditionally independent of one another,but dependent, conditional uponY. (The other two variablepairs are dependent, both conditionally and unconditionally.) Thegraph shown inFigure 8 does not satisfy FC with respect to this distribution (colloquially,the graph is not faithful to the distribution). MC, when applied tothe graph of Figure 8, does not imply the independence ofX andZ. This can be seen by noting thatX andZ ared-connected (by the empty set): neither the path $X\rightarrow Z$ nor $X \rightarrow Y\rightarrow Z$ is blocked (bythe empty set). By contrast, the graph shown inFigure 7 above is faithful to the described distribution. Note that Figure 8does satisfy the Minimality Condition with respect to thedistribution; no subgraph satisfies MC with respect to the describeddistribution. In fact, FC is strictly stronger than the MinimalityCondition.

diagram: X has an arrow pointing northeast to Y and another arrow pointing east to Z; Y has an arrow pointing southeast to Z.

Figure 8

Here are some other examples: InFigure 6 above, there is a path $W\rightarrow X\rightarrow Y$; FC impliesthatW andY should be probabilistically dependent. InFigure 7, FC implies thatX andZ should be dependent,conditional onY.

FC can fail if the probabilistic parameters in a causal model are justso. InFigure 8, for example,X influencesZ along two differentdirected paths. If the effect of one path is to exactly undo theinfluence along the other path, thenX andZ will beprobabilistically independent. If the underlying SEM is linear,Spirtes et al. (SGS 2000: Theorem 3.2) prove that the set ofparameters for which Faithfulness is violated has Lebesgue measure 0.Nonetheless, parameter values leading to violations of FC arepossible, so FC does not seem plausible as a metaphysical orconceptual constraint upon the connection between causation andprobabilities. It is, rather, amethodological principle:Given a distribution on $\{X, Y, Z\}$ in whichX andZare independent, we should prefer the causal structure depicted inFigure 7 to the one in Figure 8. This is not because Figure 8 is conclusivelyruled out by the distribution, but rather because it is preferable topostulate a causal structure thatimplies the independence ofX andZ rather than one that is merelyconsistent with independence. See Zhang and Spirtes 2016 forcomprehensive discussion of the role of FC.

Violations of FC are often detectable in principle. For example,suppose that the true causal structure is that shown inFigure 7, and that the probability distribution overX,Y, andZ exhibits all of the conditional independence relationsrequired by MC. Suppose, moreover, thatX andZ areindependent, conditional uponY. This conditional independencerelation is not entailed by MC, so it constitutes a violation of FC.It turns out that there is no DAG that is faithful to this probabilitydistribution. This tips us off that there is a violation of FC. Whilewe will not be able to infer the correct causal structure, we will atleast avoid inferring an incorrect one in this case. For details, seeSteel 2006, Zhang & Spirtes 2008, and Zhang 2013b.

Researchers have explored the consequences of adopting a variety ofassumptions that are weaker than FC; see for example Ramsey et al.2006, Spirtes & Zhang 2014, and Zhalama et al. 2016.

4.4 Identifiability of Causal Structure

If we have a set of variables $\bV$ and know the probabilitydistribution P on $\bV$, what can we infer about the causalstructure on $\bV$? This epistemological question is closely relatedto the metaphysical question of whether it is possible toreduce causation to probability (as, e.g., Reichenbach 1956and Suppes 1970 proposed).

Pearl (1988: Chapter 3) proves the following theorem:

(Identifiability with time-order)
If

the variables in $\bV$ are time-indexed, such that only earliervariables can cause later ones;
the probability P assigns positive probability to every possibleassignment of values of the variables in $\bV$;
there are no latent variables, so that the correct causal graph$\bG$ is a DAG;
and the probability measure P satisfies the Markov and MinimalityConditions with respect to $\bG$;

then it will be possible to uniquely identify $\bG$ on the basis ofP.

It is relatively easy to see why this holds. For each variable$X_i$, its parents must come from among the variables with lowertime indices, call them $X_1 ,\ldots ,X_{i-1}$. Any variables inthis group that are not parents of $X_i$ will be nondescendants of$X_i$; hence they will be screened off from $X_i$ by its parents(from MC_{Screening_off}). Thus we can start with thedistributions $\Pr(X_i\mid X_1 ,\ldots ,X_{i-1})$, and then weed outany variables from the right hand side that make no difference to theprobability distribution over $X_i$. By the Minimality Condition, weknow that the variables so weeded are not parents of $X_i$. Thosevariables that remain are the parents of $X_i$ in $\bG$.

If we don’t have information about time ordering, or othersubstantive assumptions restricting the possible causal structuresamong the variables in $\bV$, then it will not always be possible toidentify the causal structure from probability alone. In general,given a probability distribution P on $\bV$, it is only possible toidentify aMarkov equivalence class of causal structures.This will be the set of all DAGs on $\bV$ that (together with MC)imply all and only the conditional independence relations contained inP. In other words, it will be the set of all DAGs $\bG$ such that Psatisfies MC and FC with respect to $\bG$. The PC algorithmdescribed by SGS (2000: 84–85) is one algorithm that generatesthe Markov equivalence class for any probability distribution with anon-empty Markov equivalence class.

Consider two simple examples involving three variables $\{X, Y,Z\}$. Suppose our probability distribution has the followingproperties:

X andY are dependent unconditionally, andconditional onZ
Y andZ are dependent unconditionally, andconditional onX
X andZ are dependent unconditionally, butindependent conditional onY

Then the Markov equivalence class is:

\[X \rightarrow Y \rightarrow Z\\X \leftarrow Y \leftarrow Z\\X \leftarrow Y \rightarrow Z\]

We cannot determine from the probability distribution, together withMC and FC, which of these structures is correct.

On the other hand, suppose the probability distribution is asfollows:

X andY are dependent unconditionally, andconditional onZ
Y andZ are dependent unconditionally, andconditional onX
X andZ are independent unconditionally, butdependent conditional onY

Then the Markov equivalence class is:

\[X \rightarrow Y \leftarrow Z\]

This is the only DAG relative to which the given probabilitydistribution satisfies MC and FC.

4.5 Identifiability with Assumptions about Functional Form

Suppose we have a SEM with endogenous variables $\bV$ and exogenousvariables $\bU$, where each variable in $\bV$ is determined by anequation of the form:

\[X_i = f_i (\bPA(X_i), U_i).\]

Suppose, moreover, that we have a probability distribution $\Pr'$ on$\bU$ in which all of the $U_i$s are independent. This will inducea probability distribution P on $\bV$ that satisfies MC relative tothe correct causal DAG on $\bV$. In other words, our probabilisticSEM will generate a unique causal Bayes net. The methods described inthe previous section attempt to infer the underlying graph $\bG$from relations of probabilistic dependence and independence. Thesemethods can do no better than identifying the Markov equivalenceclass. Can we do better by making use of additional information aboutthe probability distribution P, beyond relations of dependence andindependence?

There is good news and there is bad news. First the bad news. If thevariables in $\bV$ are discrete, and we make no assumptions aboutthe form of the functions $f_i$, then we can infer no more about theSEM than the Markov equivalence to which the graph belongs (Meek1995).

More bad news: If the variables in $\bV$ are continuous, thesimplest assumption, and the one that has been studied in most detail,is that the equations arelinear withGaussian(normal, or bell-shaped) errors. That is:

$X_i = \sum_j c_j X_j + U_i$, wherej ranges over theindices of $\bPA(X_i)$ and the $c_j$s are constants
$Pr'$ assigns a Gaussian distribution to each $U_i$

It turns out that with these assumptions, we can do no better thaninferring the Markov equivalence class of the causal graph on $\bV$from probabilistic dependence and independence (Geiger & Pearl1988).

Now for the good news. There are fairly general assumptions that allowus to infer a good deal more. Here are some fairly simple cases:

(LiNGaM) (Shimizu et al. 2006)
If:

The variables in $\bV$ are continuous;
The functions $f_i$ are linear;
The probability distributions on the error variables $U_i$ arenot Gaussian (or at most one is Gaussian);
The error variables $U_i$ are probabilistically independent in$\Pr'$;

then the correct DAG on $\bV$ can be uniquely determined by theinduced probability distribution P on $\bV$.

(Non-linear additive) (Hoyer et al. 2009)
Almost all functions of the following form allow the correct DAG on$\bV$ to be uniquely determined by the induced probabilitydistribution P on $\bV$.:

The functions $f_i$ are nonlinear and the errors are additive(so $X_i = f_i (\bPA(X_i)) + U_i$, with $f_i$ nonlinear);
The error variables $U_i$ are probabilistically independent in$\Pr'$;

In fact, this case can be generalized considerably:

(Post non-linear) (Zhang & Hyvärinen2009)
With the exception of five specific cases that can be fully specified,all functions of the following form allow the correct DAG on $\bV$to be uniquely determined by the induced probability distribution P on$\bV$.:

The functions have the form $X_i = g_i (f_i (\bPA(X_i)) + U_i)$with $f_i$ and $g_i$ nonlinear, and $g_i$ invertible;
The error variables $U_i$ are probabilistically independent in$\Pr'$;

See also Peters et al. (2017) for discussion.

While there are specific assumptions behind these results, they arenonetheless remarkable. They entail, for example, that (given theassumptions of the theorems) knowing only the probability distributionon two variablesX andY, we can infer whetherXcausesY orY causesX.

4.6 Latent Common Causes

The discussion so far has focused on the case where there are nolatent common causes of the variables in $\bV$, and the errorvariables $U_i$ can be expected to be probabilistically independent.As we noted inSection 2.3 above, we represent a latent common cause with a double-headed arrow.For example, the acyclic directed mixed graph inFigure 9 represents a latent common cause ofX andZ. Moregenerally, we can use an ADMG like Figure 9 to represent that theerror variables forX andZ are not probabilisticallyindependent.

diagram: X has an arrow pointing east to Y which in turn has an arrow pointing east to Z; X and Z are connected by a curved double-headed arrow

Figure 9

If there are latent common causes, we expect MC_{Screening_off} and MC_{Factorization} to fail if we apply them in anaïve way. InFigure 9,Y is the only parent ofZ shown in the graph, and if wetry to apply MC_{Screening_off}, it tells us thatY shouldscreenX off fromZ. However, we would expectXandZ to be correlated, even when we condition onY, dueto the latent common cause. The problem is that the graph is missing arelevant parent ofZ, namely the omitted common cause. However,suppose that the probability distribution on $\{L, X, Y, Z\}$satisfies MC with respect to the DAG that includesL as acommon cause ofX andZ. Then it turns out that theprobability distribution willstill satisfy MC_d-separation with respect to the ADMG of Figure 9. A causalmodel incorporating an ADMG and probability distribution satisfyingMC_d-separation is called asemi-Markovcausal model (SMCM).

If we allow that the correct causal graph may be an ADMG, we can stillapply MC_d-separation, and ask which graphs imply thesame sets of conditional independence relations. The Markovequivalence class will be larger than it was when we did not allow forlatent variables. For instance, suppose that the probabilitydistribution on $\{X, Y, Z\}$ has the following features:

X andY are dependent unconditionally, andconditional onZ
Y andZ are dependent unconditionally, andconditional onX
X andZ are independent unconditionally, butdependent conditional onY

We saw inSection 4.4 that the only DAG that implies just these (in)dependencies is:

\[X \rightarrow Y \leftarrow Z\]

But if we allow for the possibility of latent common causes, therewill be additional ADMGs that also imply just these (in)dependencies.For example, the structure

\[X \leftrightarrow Y \leftrightarrow Z\]

is also in the Markov equivalence class, as are several others.

Latent variables present a further complication. Unlike the case wherethe error variables $U_i$ are probabilistically independent, a SEMwith correlated error terms may imply probabilistic constraints inaddition to conditional (in)dependence relations, even in the absenceof further assumptions about functional form. This means that we maybe able to rule out some of the ADMGs in the Markov equivalence classusing different kinds of probabilistic constraints.

4.7 Interventions

A conditional probability such as $\Pr(Y = y \mid X = x)$ gives usthe probability thatY will take the valuey, given thatX has beenobserved to take the valuex. Often,however, we are interested in predicting the value ofY thatwill result if weintervene to set the value ofXequal to some particular valuex. Pearl (2009) writes $\Pr(Y =y \mid \ido(X = x))$ to characterize this probability. The notationis misleading, since $\ido(X = x)$ is not an event in the originalprobability space. It might be more accurate to write $\Pr_{\ido(X =x)} (Y = y)$, but we will use Pearl’s notation here. What is thedifference between observation and intervention? When we merelyobserve the value that a variable takes, we are learning about thevalue of the variable when it is caused in the normal way, asrepresented in our causal model. Information about the value of thevariable will also provide us with information about its causes, andabout other effects of those causes. However, when we intervene, weoverride the normal causal structure, forcing a variable to take avalue it might not have taken if the system were left alone.Graphically, we can represent the effect of this intervention byeliminating the arrows directed into the variable intervened upon.Such an intervention is sometimes described as “breaking”those arrows. As we saw in Section3.1, in thecontext of a SEM, we represent an intervention that setsX tox by replacing the equation forX with a new onespecifying that $X = x$.

As we saw inSection 3.2, there is a close connection between interventions andcounterfactuals; in particular, the antecedents of structuralcounterfactuals are thought of as being realized by interventions.Nonetheless, Pearl (2009) distinguishes claims about interventionsrepresented by thedo operator from counterfactuals. The formerare understood in the indicative mood; they concern interventions thatare actually performed. Counterfactuals are in the subjunctive mood,and concern hypothetical interventions. This leads to an importantepistemological difference between ordinary interventions andcounterfactuals: they behave differently in the way that they interactwith observations of the values of variables. In the case ofinterventions, we are concerned with evaluating probabilities suchas

\[\Pr(\bY = \by \mid \bX =\bx, \ido(\bZ = \bz)).\]

We assume that the intervention $\ido(\bZ = \bz)$ is being performedin the actual world, and hence that we are observing the values thatother variables take $(\bX =\bx)$ in the same world where theintervention takes place. In the case of counterfactuals, we observethe value of various variables in the actual world, in which there isno intervention. We then ask whatwould have happened if aninterventionhad been performed. The variables whose valueswe observed may well take ondifferent values in thehypothetical world where the intervention takes place. Here is asimple illustration of the difference. Suppose that we have a causalmodel in which treatment with a drug causes recovery from a disease.There may be other variables and causal relations among them aswell.

Intervention:

An intervention was performed to treat a particular patient withthe drug, and it was observed that she did not recover.
Question: What is the probability that she recovered,given the intervention and the observed evidence?
Answer: Zero, trivially.

Counterfactual:

It was observed that a patient did not recover from thedisease.
Question: What is the probability that she would haverecovered, had she been treated with the drug?
Answer: Nontrivial. The answer is not necessarilyzero, nor is it necessarily P(recovery |treatment).If we know that she was in fact treated, then we could infer that shewould not have recovered if treated. But we do not know whether shewas treated. The fact that she did not recover gives us partialinformation: it makes it less likely that she was in fact treated; italso makes it more likely that she has a weak immune system, and soon. We must make use of all of this information in trying to determinethe probability that she would have recovered if treated.

We will discuss interventions in the present section, andcounterfactuals inSection 4.10 below.

Suppose that we have an acyclic structural equation model withexogenous variables $\bU$ and endogenous variables $\bV$. We haveequations of the form

\[X_i = f_i (\bPA(X_i), U_i),\]

and a probability distribution $\Pr'$ on the exogenous variables$\bU$. $\Pr'$ then induces a probability distribution P on$\bV$. To represent an intervention that sets $X_k$ to $x_k$, wereplace the equation for $X_k$ with $X_k = x_k$. Now $\Pr'$induces a new probability distribution P* on $\bV$ (since settingsof the exogenous variables $\bU$ give rise to different values ofthe variables in $\bV$ after the intervention). P* is the newprobability distribution that Pearl writes as $\Pr(• \mid \ido(X_k= x_k))$.

But even if we do not have a complete SEM, we can often compute theeffect of interventions. Suppose we have a causal model in which theprobability distribution P satisfies MC on the causal DAG $\bG$ overthe variable set $\bV = \{X_1, X_2 ,\ldots ,X_n\}$. The most usefulversion of MC for thinking about interventions is MC_{Factorization} (seeSection 4.2), which tells us:

\[\Pr(X_1, X_2 , \ldots ,X_n) = \prod_i \Pr(X_i \mid \bPA(X_i)).\]

Now suppose that we intervene by setting the value of $X_k$ to$x_k$. The post-intervention probability P* is the result ofaltering the factorization as follows:

\[\Pr^*(X_1, X_2 , \ldots ,X_n) = \Pr'(X_k) \times \prod_{i \ne k} \Pr(X_i \mid \bPA(X_i)),\]

where $\Pr'(X_k = x_k) = 1$. The conditional probabilities of theform $\Pr(X_i \mid \bPA(X_i))$ for $i \ne k$ remain unchanged bythe intervention. This gives the same result as computing the resultof an intervention using a SEM, when the latter is available. Thisresult can be generalized to the case where the intervention imposes aprobability distribution $\Pr^{\dagger}$ on some subset of thevariables in $\bV$. For simplicity, let’s re-label thevariables so that $\{X_1, X_2 ,\ldots ,X_k\}$ is the set ofvariables that we intervene on. Then, the post-interventionprobability distribution is:

\[\Pr^*(X_1, X_2 , \ldots ,X_n) = \Pr^{\dagger}( X_1, X_2 ,\ldots ,X_k) \times \prod_{k \lt i \le n} \Pr(X_i \mid \bPA(X_i)).\]

TheManipulation Theorem of SGS (2000: theorem 3.6)generalizes this formula to cover a much broader class ofinterventions, including ones that don’t break all the arrowsinto the variables that are intervened on.

Pearl (2009: Chapter 3) develops an axiomatic system he calls thedo-calculus for computing post-intervention probabilitiesthat can be applied to systems with latent variables, where the causalstructure on $\bV$ is represented by an ADMG (includingdouble-headed arrows) instead of a DAG. The axioms of this system arepresented inSupplement on thedo-calculus. One useful special case is given by the

Back-Door Criterion. LetX andY be variablesin $\bV$, and $\bZ \subseteq \bV \setminus \{X, Y\}$ suchthat:

no member of $\bZ$ is a descendant ofX; and
every path betweenX andY that terminates with anarrow intoX either (a) includes a non-collider in $\bZ$, or(b) includes a collider that has no descendants in $\bZ$;

then $\Pr(Y \mid \ido(X), \bZ) = \Pr(Y \mid X, \bZ)$.

That is, if we can find an appropriate conditioning set $\bZ$, theprobability resulting from an intervention onX will be thesame as the conditional probability corresponding to an observation ofX.

4.8. Interventionist Decision Theory

Evidential Decision Theory of the sort developed by Jeffrey (1983),runs into well-known problems in variants ofNewcomb’sproblem (Nozick 1969). For example, suppose Cheryl believes thefollowing: She periodically suffers from a potassium deficiency. Thisstate produces two effects with high probability: It causes her to eatbananas, which she enjoys; and it causes her to suffer debilitatingmigraines. On days when she suffers from the potassium deficiency, shehas no introspective access to this state. In particular, she is notaware of any banana cravings. Perhaps she rushes to work everymorning, grabbing whatever is at hand to eat on her commute.Cheryl’s causal model is represented by the DAG inFigure 10.

diagram with K having an arrow pointing northwest to B and northeast to M

Figure 10

$K = 1$ represents potassium deficiency, $B = 1$ eating a banana,and $M = 1$ migraine. Her probabilities are as follows:

\[\begin{aligned}\Pr(K = 1) & = .2\\\Pr(B = 1 \mid K = 1) & = .9, &\Pr(B = 1 \mid K = 0) & = .1\\\Pr(M = 1 \mid K = 1) & = .9, & \Pr(M = 1 \mid K = 0) & = .1\end{aligned}\]

Her utility for the state of the world $w \equiv \{K = k, B = b, M =m\}$ is $\Ur(w) = b - 20m$. That is, she gains one unit of utilityfor eating a banana, but loses 20 units for suffering a migraine. Sheassigns no intrinsic value to the potassium deficiency.

Cheryl is about to leave for work. Should she eat a banana? AccordingtoEvidential Decision Theory (EDT), Cheryl should maximizeEvidential Expected Utility, where

\[\EEU(B = b) = \sum_w \Pr(w \mid B = b)\Ur(w)\]

From the probabilities given, we can compute that:

\[\begin{aligned}\Pr(M = 1 \mid B = 1) & \approx .65\\\Pr(M = 1 \mid B = 0) & \approx .12\end{aligned}\]

Eating a banana is strongly correlated with migraine, due to thecommon cause. Thus

\[\begin{aligned}\EEU(B = 1) &\approx {-12}\\\EEU(B = 0) & \approx {-2.4}\end{aligned}\]

So EDT, at least in its simplest form, recommends abstaining frombananas. Although Cheryl enjoys them, they provide strong evidencethat she will suffer from a migraine.

Many think that this is bad advice. Eating a banana does notcause Cheryl to get a migraine; it is a harmless pleasure. Anumber of authors have formulated versions ofCausal DecisionTheory (CDT) that aim to incorporate explicitly causalconsiderations (e.g., Gibbard & Harper 1978; Joyce 1999; Lewis1981; Skyrms 1980). Causal models provide a natural setting for CDT,an idea proposed by Meek and Glymour (1994) and developed by Hitchcock(2016), Pearl (2009: Chapter 4) and Stern (2017). The central idea isthat the agent should treat her action as anintervention.This means that Cheryl should maximize herCausal ExpectedUtility:

\[\CEU(B = b) = \sum_w \Pr(w \mid \ido(B = b))\Ur(w)\]

Now we can compute

\[\begin{aligned}\Pr(M = 1 \mid \ido(B = 1)) & = .26\\\Pr(M = 1 \mid \ido(B = 0)) & = .26\end{aligned}\]

So that now

\[\begin{aligned}\CEU(B = 1) &= {-4.2}\\\CEU(B = 0) & = {-5.2}\end{aligned}\]

This yields the plausible result that eating a banana gives Cheryl afree unit of utility. By intervening, Cheryl breaks the arrow fromK toB and destroys the correlation between eating abanana and suffering a migraine.

More generally, one can use the methods for calculating the effects ofinterventions described in the previous section to compute theprobabilities needed to calculate Causal Expected Utility. Stern(2017) expands this approach to allow for agents who distribute theircredence over multiple causal models. Hitchcock (2016) shows how thedistinction between interventions and counterfactuals, discussed inmore detail inSection 4.10 below, can be used to deflect a number of alleged counterexamples toCDT.

There is much more that can be said about the debate between EDT andCDT. For instance, if Cheryl knows that she is intervening, then shewill not believe herself to be accurately described by the causalstructure inFigure 10. Instead, she will believe herself to instantiate a causal structurein which the arrow fromK toB is removed. In thiscausal structure, if P satisfies MC, we will have $\Pr(w \mid B = b)= \Pr(w \mid \ido(B = b))$, and the difference between EDT and CDTcollapses. If there is a principled reason why a deliberating agentwill always believe herself to be intervening, then EDT will yield thesame normative recommendations as CDT, and will avoid counterexampleslike the one described above. Price’s defense of EDT (Price1986) might be plausibly reconstructed along these lines. So the moralis not necessarily that CDT is normatively correct, but rather thatcausal models may be fruitfully employed to clarify issues in decisiontheory connected with causation.

4.9 Causal Discovery with Interventions

In the previous section, we discussed how to use knowledge (orassumptions) about the structure of a causal graph $\bG$ to makeinferences about the results of interventions. In this section, weexplore the converse problem. If we can intervene on variables andobserve the post-intervention probability distribution, what can weinfer about the underlying causal structure? This topic has beenexplored extensively in the work of Eberhardt and his collaborators.(See, for example, Eberhardt & Scheines 2007 and Hyttinen et al.2013a.) Unsurprisingly, we can learn more about causal structure if wecan perform interventions than if we can only make passiveobservations. However, just how much we can infer depends upon whatkinds of interventions we can perform, and on what backgroundassumptions we make.

If there are no latent common causes, so that the true causalstructure on $\bV$is represented by a DAG $\bG$, then it will always be possible to discoverthe complete causal structure using interventions. If we can onlyintervene on one variable at a time, we may need to separatelyintervene on all but one of the variables before the causal structureis uniquely identified. If we can intervene on multiple variables atthe same time, we can discover the true causal structure morequickly.

If there are latent common causes, so that the true causal structureon $\bV$ is represented by an ADMG, then it may not be possible todiscover the true causal structure using only single-variableinterventions. (Although we can do this in the special case where thefunctions in the underlying structural equation model are all linear.)However, if we can intervene on multiple variables at the same time,then it is possible to discover the true causal graph.

Eberhardt and collaborators have also explored causal discovery usingsoft interventions. A soft intervention influences the valueof a variable without breaking the arrows into that variable. Forinstance, suppose we want to know whether increasing the income ofparolees will lead to decreased recidivism. We randomly dividesubjects into treatment and control conditions, and give regular cashpayments to those in the treatment condition. This is not anintervention on incomeper se, since income will still beinfluenced by usual factors: savings and investments, job training,help from family members, and so on. Soft interventions facilitatecausal inference because they create colliders, and as we have seen,colliders have a distinct probabilistic signature. Counterintuitively,this means that if we want to determine whetherX causesY it is desirable to perform a soft intervention onY(rather thanX), to see if we can create a collider$I\rightarrow Y\leftarrow X$ (whereI is the intervention).Soft interventions are closely related toinstrumentalvariables. If there are no latent common causes, we can infer thetrue causal structure using soft interventions. Indeed, if we canintervene on every variable at once, we can determine the correctcausal structure from this one intervention. However, if there arelatent common causes, it is not in general possible to discover thecomplete causal structure using soft interventions. (Although this canbe done if we assume linearity.)

4.10 Counterfactuals

Section 3.3 above discussed counterfactuals in the context of deterministiccausal models. The introduction of probability adds a number ofcomplications. In particular, we can now talk meaningfully about theprobability of a counterfactual being true. Counterfactuals play acentral role in thepotential outcome framework for causalmodels pioneered by Neyman (1923), and developed by Rubin (1974) andRobins (1986), among others.

Counterfactuals in the potential outcome framework interact withprobability differently than counterfactuals in Lewis’s (1973b)framework. Suppose that Ted was exposed to asbestos and developed lungcancer. We are interested in the counterfactual: “If Ted had notbeen exposed to asbestos, he would not have developed lungcancer”. Suppose that the processes by which cancer develops aregenuinely indeterministic. Then, it seems wrong to say that if Ted hadnot been exposed to asbestos, he definitely would have developed lungcancer; and it seems equally wrong to say that he definitely would nothave developed lung cancer. In this case, Lewis would say that thecounterfactual “If Ted had not been exposed to asbestos, hewould not have developed lung cancer” is determinatelyfalse. As a result, the objective probability of thiscounterfactual being true is zero. On the other hand, a counterfactualwith objective probabilityin the consequent may be true:“If Ted had not been exposed to asbestos, his objective chanceof developing lung cancer would have been .06”. By contrast, inthe potential outcome framework, probability may be pulled out of theconsequent and applied to the counterfactual as a whole: Theprobability of the counterfactual “If Ted had not been exposedto asbestos, he would have developed lung cancer” can be.06.

If we have a complete structural equation model, we can assignprobabilities to counterfactuals, in light of observations. Let $\bV= \{X_1, X_2 ,\ldots ,X_n\}$ be a set of endogenous variables, and$\bU = \{U_1, U_2 ,\ldots ,U_n\}$ a set of exogenous variables. Ourstructural equations have the form:

\[X_i = f_i (\bPA(X_i), U_i)\]

We have a probability distribution $\Pr'$ on $\bU$, which inducesa probability distribution P on $\bU \cup \bV$. Suppose that weobserve the value of some of the variables: $X_j = x_j$ for all $j\in \bS \subseteq \{1,\ldots ,n\}$. We now want to assess thecounterfactual “if $X_k$ had been $x_k$, then $X_l$ wouldhave been $x_l$”, wherek andl may be in$\bS$ but need not be. We can evaluate the probability of thiscounterfactual using this three-step process:

Update the probability P by conditioning on the observations, toget a new probability distribution $\Pr(• \mid \cap_{j \in \bS}X_j = x_j)$. Call the restriction of this probability function to$\bU$ $\Pr''$.
Replace the equation for $X_k$ with $X_k = x_k$.
Use the distribution $\Pr''$ on $\bU$ together with themodified set of equations to induce a new probability distribution P*on $\bV$. $\Pr^*( X_l = x_l)$ is then the probability of thecounterfactual.

This procedure differs from the procedure for interventions (discussedinSection 4.7) in that steps 1 and 2 have been reversed. We first update theprobability distribution, then perform the intervention. This reflectsthe fact that the observations tell us about the actual world, inwhich the intervention did not (necessarily) occur.

If we do not have a complete SEM, it is not generally possible toidentify the probability of a counterfactual, but only to set upperand lower bounds. For example, suppose that we believe that asbestosexposure causes lung cancer, so that we posit a simple DAG:

\[A \rightarrow L.\]

Suppose also that we have data for people similar to Ted which yieldsthe following probabilities:

\[\begin{aligned}\Pr(L = 1 \mid A = 1) & = .11,\\\Pr(L = 1 \mid A = 0) & = .06.\end{aligned}\]

(We are oversimplifying, and treating asbestos and lung cancer asbinary variables.) We observe that Ted was in fact exposed to asbestosand did in fact develop lung cancer. What is the probability of thecounterfactual: “If Ted had not been exposed to asbestos, hewould not have developed lung cancer”? Pearl (2009) calls aprobability of this form aprobability of necessity. It isoften called theprobability of causation, although thisterminology is misleading for reasons discussed by Greenland andRobins (1988). This quantity is often of interest in tort law. Supposethat Ted sues his employer for damages related to his lung cancer. Hewould have to persuade a jury that his exposure to asbestos caused hislung cancer. American civil law requires a “more probable thannot” standard of proof, and it employs a “but for”or counterfactual definition of causation. Hence Ted must convince thejury that it is more probable than not that he would not havedeveloped lung cancer if he had not been exposed.

We may divide the members of the population into four categories,depending upon which counterfactuals are true of them:

doomed individuals will develop lung cancer no matterwhat
immune individuals will avoid lung cancer no matterwhat
sensitive individuals will develop lung cancer just incase they are exposed to asbestos
reverse sensitive individuals will develop lung cancerjust in case they are not exposed to asbestos

It is easiest to think of the population as being divided into fourcategories, with each person being one of these four types. However,we do not need to assume that the process is deterministic; it may bethe case that each person only has a certain probability of fallinginto one of these categories.

Mathematically, this is equivalent to the following. Let $U_L$ bethe error variable for $L. U_L$ takes values of the form $(u_1,u_2)$ with each $u_i$ being 0 or 1. $(1, 1)$ corresponds todoomed, $(0, 0)$ toimmune, $(1, 0)$ tosensitive, and $(0, 1)$ toreverse. That is, thefirst element tells us what valueL will take if an individualis exposed to asbestos, and the second element what valueLwill take if an individual is not exposed. The equation forLwill be $L = (A \times u_1) + ((1 - A) \times u_2)$.

Let us assume that the distribution of the error variable $U_L$ isindependent of asbestos exposureA. The observed probability oflung cancer is compatible with both of the following probabilitydistributions over our four counterfactual categories:

\[\begin{aligned}\Pr_1(\textit{doomed}) & = .06, &\Pr_2(\textit{doomed}) &= 0,\\\Pr_1(\textit{immune}) & = .89, & \Pr_2(\textit{immune}) & = .83,\\\Pr_1(\textit{sensitive}) & = .05, & \Pr_2(\textit{sensitive}) & = .11, \\\Pr_1(\textit{reverse}) & = 0 & \Pr_2(\textit{reverse}) & = .06\end{aligned}\]

More generally, the observed probability is compatible with anyprobability $\Pr'$ satisfying:

\[\begin{aligned}\Pr'(\textit{doomed}) + \Pr'(\textit{senstive}) & = \Pr(L \mid A) & = .11;\\\Pr'(\textit{immune}) + \Pr'(\textit{reverse}) & = \Pr({\sim}L \mid A) & = .89;\\\Pr'(\textit{doomed}) + \Pr'(\textit{reverse}) & = \Pr(L \mid {\sim}A) & = .06;\\\Pr'(\textit{immune}) + \Pr'(\textit{senstive}) & = \Pr({\sim}L \mid {\sim}A) & = .94.\\ \end{aligned} \]

$\Pr_1$ and $\Pr_2$ are just the most extreme cases. From the factthat Ted was exposed to asbestos and developed lung cancer, we knowthat he is eithersensitive ordoomed. Thecounterfactual of interest will be true just in case he issensitive. Hence the probability of the counterfactual, giventhe available evidence, is P(sensitive |sensitiveordoomed). However, using $\Pr_1$ yields a conditionalprobability of .45 (5/11), while $\Pr_2$ yields a conditionalprobability of 1. Given the information available to us, all we canconclude is that the probability of necessity is between .45 and 1. Todetermine the probability more precisely, we would need to know theprobability distribution of the error variable.

A closely related counterfactual quantity is what Pearl (2009) callstheprobability of sufficiency. Suppose that Teresa, unlikeTed, was not exposed to asbestos, and did not develop lung cancer. Theprobability of sufficiency is the probability that shewouldhave suffered lung cancer if shehad been exposed. That is,the probability of sufficiency is the probability that if the causewere added to a situation in which it and the effect was absent, itwould have resulted in the effect occurring. The probability ofsufficiency is closely related to the quantity that Sheps (1958)called therelative difference, and that Cheng (1997) callsthecausal power. Cheng’s terminology reflects the ideathat the probability of sufficiency ofC forE is thepower ofC to bring aboutE in cases whereE isabsent. As in the case of the probability of necessity, if one doesnot have a complete structural equation model, but only a Causal BayesNet or Semi-Markov Causal Model, it is usually only possible to putupper and lower bounds on the probability of sufficiency. Using theprobabilities from the previous example, the probability ofsufficiency of asbestos for lung cancer would be between .05 (5/94)and .12 (11/94).

Determining the probabilities of counterfactuals, even just upper andlower bounds, is computationally demanding. Balke and Pearl’stwin network method (Balke and Pearl (1994a), (1994b); Pearl (2009,pp. 213 - 215)) and Richardson and Robins’ split-node method(Richardson and Robins (2016)) are two methods that have been proposedfor solving this kind of problem.

5. Further Reading

The most important works surveyed in this entry are Pearl 2009 andSpirtes, Glymour, & Scheines 2000. Pearl 2010, Pearl et al. 2016,and Pearl & Mackenzie 2018 are three overviews of Pearl’sprogram. Pearl 2010 is the shortest, but the most technical. Pearl& Mackenzie 2018 is the least technical. Scheines 1997 and the“Introduction” of Glymour & Cooper 1999 are accessibleintroductions to the SGS program. Eberhardt 2009, Hausman 1999,Glymour 2009, and Hitchcock 2009 are short overviews that cover someof the topics raised in this entry.

The entry oncausation and manipulability contains extensive discussion of interventions, and some discussionof causal models.

Halpern (2016) engages with many of the topics inSection 3. See also the entry forcounterfactual theories of causation.

The entry onprobabilistic causation contains some overlap with the present entry. Some of the material fromSection 4 of this entry is also presented in Section 3 of that entry. Thatentry contains in addition some discussion of the connection betweenprobabilistic causal models and earlier probabilistic theories ofcausation.

Eberhardt 2017 is a short survey that provides a clear introduction tomany of the topics covered inSections 4.2 through 4.6, as well as Section4.9. Spirtes andZhang 2016 is a longer and more technical overview that covers much ofthe same ground. It has particularly good coverage on the issuesraised inSection 4.5.

The entries ondecision theory andcausal decision theory present more detailed background information about some of the issuesraised inSection 4.8.

This entry has focused on topics that are likely to be of mostinterest to philosophers. There are a number of important technicalissues that have been largely ignored. Many of these address problemsthat arise when various simplifying assumptions made here (such asacyclicity, and knowledge of the true probabilities) arerejected. Some of these issues are briefly surveyed along withreferences inSupplement on Further Topics in Causal Inference.

Bibliography

Balke, Alexander and Judea Pearl, 1994a, “ProbabilisticEvaluation of Counterfactual Queries”, in Barbara Hayes-Roth andRichard E Korf (eds.),Proceedings of the Twelfth NationalConference on Artificial Intelligence, Volume I, Menlo Park CA:AAAI Press, pp. 230–237. [Balke & Pearl 1994a available online]
–––, 1994b, “Counterfactual Probabilities:Computational Methods, Bounds, and Applications”, in Ramon Lopezde Mantaras and David Poole (eds.),Proceedings of the TenthConference on Uncertainty in Artificial Intelligence, SanFrancisco: Morgan Kaufmann, pp. 46–54. [Balke & Pearl 1994b available online]
Bareinboim, Elias, and Judea Pearl, 2013, “A GeneralAlgorithm for Deciding Transportability of ExperimentalResults”,Journal of Causal Inference, 1(1):107–134. doi:10.1515/jci-2012-0004
–––, 2014, “Transportability from MultipleEnvironments with Limited Experiments: Completeness Results”, inZoubin Ghahramani, Max Welling, Corinna Cortes, and Neil Lawrence andKilian Weinberger (eds.),Advances of Neural InformationProcessing 27 (NIPS Proceedings), 280–288. [Bareinboim & Pearl 2014 available online]
–––, 2015, “Causal Inference and theData-Fusion Problem”,Proceedings of the National Academy ofSciences, 113(27): 7345–7352.doi:10.1073/pnas.1510507113
Beckers, Sander and Joost Vennekens, 2018, “A PrincipledApproach to Defining Actual Causation”,Synthese,195(2): 835–862. doi:10.1007/s11229-016-1247-1
Beebee, Helen, Christopher Hitchcock, and Peter Menzies (eds.),2009,The Oxford Handbook of Causation, Oxford: OxfordUniversity Press.
Blanchard, Thomas, and Jonathan Schaffer, 2017,“Causewithout Default”, in Helen Beebee, Christopher Hitchcock, andHuw Price (eds.).Making a Difference, Oxford: OxfordUniversity Press, pp. 175–214.
Briggs, Rachael, 2012, “InterventionistCounterfactuals”,Philosophical Studies160(1):139–166. doi:10.1007/s11098-012-9908-5
Cartwright, Nancy, 1993, “Marks and Probabilities: Two Waysto Find Causal Structure”, in Fritz Stadler (ed.),Scientific Philosophy: Origins and Development, Dordrecht:Kluwer, 113–119. doi:10.1007/978-94-017-2964-2_7
–––, 2007,Hunting Causes and UsingThem, Cambridge: Cambridge University Press.doi:10.1017/CBO9780511618758
Chalupka, Krzysztof, Frederick Eberhardt, and Pietro Perona, 2017,“Causal Feature Learning: an Overview”,Behaviormetrika, 44(1): 137–167.doi:10.1007/s41237-016-0008-2
Cheng, Patricia, 1997, “From Covariation to Causation: ACausal Power Theory”,Psychological Review, 104(2):367– 405. doi:10.1037/0033-295X.104.2.367
Claassen, Tom and Tom Heskes, 2012, “A Bayesian Approach toConstraint Based Causal Inference”, in Nando de Freitas andKevin Murphy (eds.)Proceedings of the Twenty-Eighth Conference onUncertainty in Artificial Intelligence, Corvallis, OR: AUAIPress, pp. 207–216. [Claassen & Heskes 2012 available online]
Cooper, G. F. and Herskovits, E. 1992, “A Bayesian Methodfor the Induction of Probabilistic Networks from Data”,Machine Learning, 9(4): 309–347.doi:10.1007/BF00994110
Danks, David, and Sergey Plis, 2014, “Learning CausalStructure from Undersampled Time Series”,JMLR Workshop andConference Proceedings (NIPS Workshop on Causality). [Danks & Plis 2014 available online]
Dash, Denver and Marek Druzdzel, 2001, “Caveats For CausalReasoning With Equilibrium Models”, in Salem Benferhat andPhilippe Besnard (eds.)Symbolic and Quantitative Approaches toReasoning with Uncertainty, 6th European Conference,Proceedings. Lecture Notes in Computer Science 2143, Berlin andHeidelberg: Springer,pp. 92–103. doi:10.1007/3-540-44652-4\_18
Dechter, Rina and Thomas Richardson (eds.), 2006,Proceedingsof the Twenty-Second Conference on Uncertainty in ArtificialIntelligence, Corvallis, OR: AUAI Press.
Dowe, Phil, 2000,Physical Causation, Cambridge:University of Cambridge Press. doi:10.1017/CBO9780511570650
Eberhardt, Frederick, 2009, “Introduction to theEpistemology of Causation”,Philosophy Compass, 4(6):913–925. doi:10.1111/j.1747-9991.2009.00243.x
–––, 2017, “Introduction to theFoundations of Causal Discovery”,International Journal ofData Science and Analytics, 3(2): 81–91.doi:10.1007/s41060-016-0038-6
Eberhardt, Frederick and Richard Scheines, 2007,“Interventions and Causal Inference”,Philosophy ofScience, 74(5): 981–995. doi:10.1086/525638
Eells, Ellery, 1991,Probabilistic Causality, Cambridge:Cambridge University Press. doi:10.1017/CBO9780511570667
Eichler, Michael, 2012, “Causal Inference in Time SeriesAnalysis”, in Carlo Berzuini, Philip Dawid, and LuisaBernardinelli (eds.),Causality: Statistical Perspectives andApplications, Chichester, UK: Wiley, pp. 327–354.doi:10.1002/9781119945710.ch22
Fine, Kit, 2012, “Counterfactuals without PossibleWorlds”,Journal of Philosophy, 109(3): 221–246.doi:10.5840/jphil201210938
Galles, David, and Judea Pearl, 1998, “An AxiomaticCharacterization of Causal Counterfactuals”,Foundations ofScience, 3(1): 151–182. doi:10.1023/A:1009602825894
Geiger, Dan and David Heckerman, 1994, “Learning GaussianNetworks”, Technical Report MSR-TR-94-10, MicrosoftResearch.
Geiger, Dan and Judea Pearl, 1988, “On the Logic of CausalModels”, in Ross Shachter, Tod Levitt, Laveen Kanal, and JohnLemmer (eds.),Proceedings of the Fourth Conference on Uncertaintyin Artificial Intelligence, Corvallis, OR: AUAI Press, pp.136–147.
Gibbard, Alan, and William Harper, 1978, “Counterfactualsand Two Kinds of Expected Utility”, in Clifford Hooker, JamesLeach, and Edward McClennen (eds.),Foundations and Applicationsof Decision Theory, Dordrecht: Reidel, pp. 125–62.
Glennan, Stuart, 2017,The New Mechanical Philosophy,Oxford: Oxford University Press.
Glymour, Clark, 2009, “Causality and Statistics”, inBeebee, Hitchcock, and Menzies 2009: 498–522.
Glymour, Clark and Gregory Cooper, 1999,Computation,Causation, and Discovery, Cambridge, MA: MIT Press.
Glymour, Clark, David Danks, Bruce Glymour, Frederick Eberhardt,Joseph Ramsey, Richard Scheines, Peter Spirtes, Choh Man Teng, andJiji Zhang, 2010, “Actual Causation: a Stone Soup Essay”,Synthese, 175(2): 169–192.doi:10.1007/s11229-009-9497-9
Glymour, Clark and Frank Wimberly, 2007, “Actual Causes andThought Experiments”, in Joseph Campbell, MichaelO’Rourke, and Harry Silverstein (eds.),Causation andExplanation, Cambridge, MA: MIT Press, pp. 43–68.
Gong, Mingming, Kun Zhang, Bernhard Schölkopf, Dacheng Tao,and Philipp Geiger, 2015, “Discovering Temporal Causal Relationsfrom Subsampled Data”, in Francis Bach and David Blei (eds.),Proceeding of the 32^nd International Conference onMachine Learning, 37: 1898–1906. [Gong et al. 2015 available online]
Gong, Mingming, Kun Zhang, Bernhard Schölkopf, Clark Glymour,and Dacheng Tao, 2017, “Causal Discovery from TemporallyAggregated Time Series”, in Gal Elidan and Kristian Kersting(eds.),Proceedings of the Thirty-Third Conference on Uncertaintyin Artificial Intelligence, Corvallis, OR: AUAI Press. [Gong et al. 2017 available online]
Greenland, Sander, and James Robins, 1988, “ConceptualProblems in the Definition and Interpretation of AttributableFractions”,American Journal of Epidemiology, 128(6):1185–1197. doi:10.1093/oxfordjournals.aje.a115073
Hall, Ned, 2007, “Structural Equations and Causation”,Philosophical Studies, 132(1): 109–136.doi:10.1007/s11098-006-9057-9
Halpern, Joseph Y., 2000, “Axiomatizing CausalReasoning”,Journal of Artificial IntelligenceResearch, 12: 317–337. [Halpern 2000 available online]
–––, 2008, “Defaults and Normality inCausal Structures”, in Gerhard Brewka and JérômeLang (eds.),Principles of Knowledge Representation and Reasoning:Proceedings of the Eleventh International Conference, Menlo Park,CA: AAAI Press, pp. 198–208.
–––, 2016,Actual Causality, Cambridge,MA: MIT Press.
Halpern, Joseph Y. and Christopher Hitchcock, 2015, “GradedCausation and Defaults”,British Journal for Philosophy ofScience, 66(2): 413–57. doi:10.1093/bjps/axt050
Halpern, Joseph and Judea Pearl, 2001, “Causes andExplanations: A Structural-Model Approach. Part I: Causes”, inJohn Breese and Daphne Koller (eds.),Proceedings of theSeventeenth Conference on Uncertainty in Artificial Intelligence,San Francisco: Morgan Kaufmann, pp. 194–202
–––, 2005, “Causes and Explanations: AStructural-Model Approach. Part I: Causes”,British Journalfor the Philosophy of Science, 56(4): 843–887.doi:10.1093/bjps/axi147
Hausman, Daniel M., 1999, “The Mathematical Theory ofCausation”,British Journal for the Philosophy ofScience, 50(1): 151–162. doi:10.1093/bjps/50.1.151
Hausman, Daniel M. and James Woodward, 1999, “Independence,Invariance, and the Causal Markov Condition”,BritishJournal for the Philosophy of Science, 50(4): 521–583.doi:10.1093/bjps/50.4.521
–––, 2004, “Modularity and the CausalMarkov Condition: a Restatement”,British Journal for thePhilosophy of Science, 55(1): 147–161.doi:10.1093/bjps/55.1.147
Hitchcock, Christopher, 2001, “The Intransitivity ofCausation Revealed in Equations and Graphs”,Journal ofPhilosophy, 98(6): 273–299. doi:10.2307/2678432
–––, 2007, “Prevention, Preemption, andthe Principle of Sufficient Reason”,PhilosophicalReview, 116(4): 495–532. doi:10.1215/00318108-2007-012
–––, 2009, “Causal Models”, inBeebee, Hitchcock, and Menzies 2009: 299–314.
–––, 2016, “Conditioning, Intervening, andDecision”,Synthese, 193(4): 1157–1176.doi:10.1007/s11229-015-0710-8
Hoyer, Patrik O., Dominik Janzing, Joris Mooij, Jonas Peters, andBernhard Schölkopf, 2009, “Nonlinear Causal Discovery withAdditive Noise Models”,Advances in Neural InformationProcessing Systems, 21: 689–696. [Hoyer et al. 2009 available online]
Huang, Yimin and Marco Valtorta, 2006, “Pearl’sCalculus of Intervention Is Complete”, in Dechter and Richardson2006: 217–224. [Huang & Valtorta 2006 available online]
Hyttinen, Antti, Frederick Eberhardt, and Patrik O. Hoyer, 2013a,“Experiment Selection for Causal Discovery”,Journalof Machine Learning Research, 14: 3041–3071. [Hyttinen, Eberhardt, & Hoyer 2013a available online]
Hyttinen, Antti, Frederick Eberhardt, and Matti Järvisalo,2014, “Constraint-based Causal Discovery: Conflict Resolutionwith Answer Set Programming”, in Nevin Zhang and Jin Tian(eds.),Proceedings of the Thirtieth Conference on Uncertainty inArtificial Intelligence, Corvallis, OR: AUAI Press, pp.340–349.
–––, 2015, “Do-calculus When the TrueGraph is Unknown”, in Marina Meila and Tom Heskes (eds.),Uncertainty in Artificial Intelligence: Proceedings of theThirty-First Conference, Corvallis, OR: AUAI Press, pp.395–404.
Hyttinen, Antti, Patrik O. Hoyer, Frederick Eberhardt, and MattiJärvisalo, 2013b, “Discovering Cyclic Causal Models withLatent Variables: A General SAT-Based Procedure”, in Nichols andSmyth 2013: 301–310.
Hyttinen, Antti, Sergey Plis, Matti Järvisalo, FrederickEberhardt, and David Danks, 2016, “Causal Discovery fromSubsampled Time Series Data by Constraint Optimization”, inAlessandro Antonucci, Giorgio Corani, Cassio Polpo Campos (eds.)Proceedings of the Eighth International Conference onProbabilistic Graphical Models, pp. 216–227.
Jeffrey, Richard, 1983,The Logic of Decision, SecondEdition, Chicago: University of Chicago Press.
Joyce, James M., 1999,The Foundations of Causal DecisionTheory, Cambridge: Cambridge University Press.doi:10.1017/CBO9780511498497
Lewis, David, 1973a, “Causation”,Journal ofPhilosophy, 70(17): 556–567. doi:10.2307/2025310
–––, 1973b,Counterfactuals, Oxford:Blackwell.
–––, 1979, “Counterfactual Dependence andTime’s Arrow”,Noûs, 13(4): 455–476.doi:10.2307/2215339
–––, 1981, “Causal Decision Theory”,Australasian Journal of Philosophy, 59(1): 5–30.doi:10.1080/00048408112340011
Machamer, Peter, Lindley Darden, and Carl Craver, 2000,“Thinking about Mechanisms”,Philosophy ofScience, 67(1): 1–25. doi:10.1086/392759
Maier, Marc, Katerina Marazopoulou, David Arbour, and DavidJensen, 2013, “A Sound and Complete Algorithm for LearningCausal Models from Relational Data”, in Nichols and Smyth 2013:371–380. [Maier et al. 2013 available online]
Maier, Marc, Brian Taylor, Hüseyin Oktay, and David Jensen,2010, “Learning Causal Models of Relational Domains”, inMaria Fox and David Poole (eds.),Proceedings of the Twenty-FourthAAAI Conference on Artificial Intelligence, (Menlo Park CA: AAAIPress), pp. 531–538. [Maier et al. 2010 available online]
Meek, Christopher, 1995, “Strong Completeness andFaithfulness in Bayesian Networks”, in Philippe Besnard andSteve Hanks (eds.)Proceedings of the Eleventh ConferenceConference on Uncertainty in Artificial Intelligence, SanFrancisco: Morgan Kaufmann, pp. 411–418.
Meek, Christopher and Clark Glymour, 1994, “Conditioning andIntervening”,British Journal for Philosophy ofScience,, 45(4): 1001–1024. doi:10.1093/bjps/45.4.1001
Menzies, Peter, 2004, “Causal Models, Token Causation, andProcesses”,Philosophy of Science, 71(5):820–832. doi:10.1086/425057
Mooij, Joris, Dominik Janzing, and Bernhard Schölkopf, 2013,“From Ordinary Differential Equations to Structural CausalModels: the Deterministic Case”, in Nichols and Smyth 2013:440–448.
Neal, Radford M., 2000, “On Deducing ConditionalIndependence from d-separation in Causal Graphs with Feedback”,Journal of Artificial Intelligence Research, 12: 87–91. [Neal 2000 available online]
Neapolitan, Richard, 2004,Learning Bayesian Networks,Upper Saddle River, NJ: Prentice Hall.
Neapolitan, Richard and Xia Jiang, 2016, “The BayesianNetwork Story”, in Alan Hájek and Christopher Hitchcock(eds.),The Oxford Handbook of Probability and Philosophy,Oxford: Oxford University Press, pp. 183–99.
Neyman, Jerzy, 1923 [1990], “Sur les Applications de laThéorie des Probabilités aux Experiences Agricoles:Essai des Principes”)Roczniki Nauk Rolniczych, Tom, X:1–51. Excerpts translated into English by D. M. Dabrowska andTerrence Speed, 1990, “On the Application of Probability Theoryto Agricultural Experiments. Essay on Principles”,Statistical Science, 5(4): 465–80.doi:10.1214/ss/1177012031
Ann Nichols and Padhraic Smyth (eds), 2013,Proceedings of theTwenty-Ninth Conference on Uncertainty in ArtificialIntelligence, Corvallis, OR: AUAI Press.
Nozick, Robert, 1969, “Newcomb’s Problem and TwoPrinciples of Choice”, in Nicholas Rescher (ed.),Essays inHonor of Carl G. Hempel, Dordrecht: Reidel, pp. 114–146.doi:10.1007/978-94-017-1466-2_7
Pearl, Judea, 1988,Probabilistic Reasoning in IntelligentSystems, San Francisco: Morgan Kaufmann.
–––, 1995, “Causal Diagrams for EmpiricalResearch”,Biometrika, 82(4): 669–688.doi:10.1093/biomet/82.4.669
–––, 2009,Causality: Models, Reasoning, andInference, Second Edition, Cambridge: Cambridge UniversityPress.
–––, 2010, “An Introduction to CausalInference”,The International Journal of Biostatistics,6(2): article 7, pp. 1–59. doi:10.2202/1557-4679.1203
Pearl, Judea and Rina Dechter, 1996, “IdentifyingIndependencies in Causal Graphs with Feedback”, in Eric Horvitzand Finn Jensen (eds.)Proceedings of the Twelfth Conference onUncertainty in Artificial Intelligence, San Francisco: MorganKaufmann, pages 420–426.
Pearl, Judea, Madelyn Glymour, and Nicholas P. Jewell, 2016,Causal Inference in Statistics: A Primer, Chichester, UK:Wiley.
Pearl, Judea and Mackenzie, Dana, 2018,The Book of Why: TheNew Science of Cause and Effect., New York: Basic Books.
Pearl, Judea and Verma, Thomas, 1991, “A Theory of InferredCausation”, in James Allen, Richard Fiskes, and Erik Sandewall(eds.),Principles of Knowledge Representation and Reasoning:Proceedings of the Second International Conference, San Mateo,CA: Morgan Kaufmann, pp. 441–52.
Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf, 2017,Elements of Causal Inference: Foundations and LearningAlgorithms., Cambridge, MA: MIT Press.
Price, Huw, 1986, “Against Causal Decision Theory”,Synthese, 67(2): 195–212. doi:10.1007/BF00540068
Ramsey, Joseph, Peter Spirtes, and Jiji Zhang, 2006,“Adjacency Faithfulness and Conservative CausalInference”, in Dechter and Richardson 2006: 401–408. [Ramsey, Spirtes, & Zhang 2006 available online]
Reichenbach, Hans, 1956,The Direction of Time, Berkeleyand Los Angeles: University of California Press.
Richardson, Thomas, and James Robins, 2016,Single WorldIntervention Graphs (SWIGs): A Unification of the Counterfactual andGraphical Approaches to Causality, Hanover, MA: NowPublishers.
Robins, James, 1986, “A New Approach to Causal Inference inMortality Studies with a Sustained Exposure Period: Applications toControl of the Healthy Workers Survivor Effect”,Mathematical Modeling, 7(9–12): 1393–1512.doi:10.1016/0270-0255(86)90088-6
Rubin, Donald, 1974, “Estimating Causal Effects ofTreatments in Randomized and Nonrandomized Studies”,Journalof Educational Psychology, 66(5): 688–701.doi:10.1037/h0037350
Salmon, Wesley, 1984,Scientific Explanation and the CausalStructure of the World, Princeton: Princeton UniversityPress.
Scheines, Richard, 1997, “An Introduction to CausalInference” in V. McKim and S. Turner (eds.),Causality inCrisis?, Notre Dame: University of Notre Dame Press, pp.185–199.
Schulte, Oliver and Hassan Khosravi, 2012, “LearningGraphical Models for Relational Data via Lattice Search”,Machine Learning, 88(3): 331–368.doi:10.1007/s10994-012-5289-4
Schulte, Oliver, Wei Luo, and Russell Greiner, 2010, “MindChange Optimal Learning of Bayes Net Structure from Dependency andIndependency Data”,Information and Computation,208(1): 63–82. doi:10.1016/j.ic.2009.03.009
Shalizi, Cosma Rohilla, and Andrew C. Thomas, 2011,“Homophily and Contagion are Generically Confounded inObservational Social Studies”,Sociological Methods andResearch, 40(2): 211–239. doi:10.1177/0049124111404820
Sheps, Mindel C., 1958, “Shall We Count the Living or theDead?”New England Journal of Medicine, 259(12):210–4. doi:10.1056/NEJM195812182592505
Shimizu, Shohei, Patrik O. Hoyer, Aapo Hyvärinen, and AnttiKermine, 2006, “A Linear Non-Gaussian Acyclic Model for CausalDiscovery”,Journal of Machine Learning Research, 7:2003–2030. [Shimizu et al. 2006 available online]
Shpitser, Ilya and Judea Pearl, 2006, “Identification ofConditional Interventional Distributions”, in Dechter andRichardson 2006: 437–444. [Shpister & Pearl 2006 available online]
Skyrms, Brian, 1980,Causal Necessity, New Haven andLondon: Yale University Press.
Spirtes, Peter, 1995, “Directed Cyclic GraphicalRepresentation of Feedback Models”, in Philippe Besnard andSteve Hanks (eds.),Proceedings of the Eleventh Conference onUncertainty in Artificial Intelligence, San Francisco: MorganKaufmann, pp. 491–498.
Spirtes, Peter, Clark Glymour, and Richard Scheines, [SGS] 2000,Causation, Prediction and Search, Second Edition, Cambridge,MA: MIT Press.
Spirtes, Peter and Jiji Zhang, 2014, “A Uniformly ConsistentEstimator of Causal Effects under thek-Triangle-FaithfulnessAssumption”,Statistical Science, 29(4): 662–678.doi:10.1214/13-STS429
Spirtes, Peter and Kun Zhang, 2016, “Causal Discovery andInference: Concepts and Recent Methodological Advances”,Applied Informatics, 3: 3. doi:10.1186/s40535-016-0018-x
Stalnaker, Robert, 1968, “A Theory of Conditionals”,in Nicholas Rescher (ed.)Studies in Logical Theory,Blackwell: Oxford, pp. 98–112.
Steel, Daniel, 2006, “Homogeneity, Selection, and theFaithfulness Condition”.Minds and Machines, 16(3):303–317. doi:10.1007/s11023-006-9032-4
Stern, Reuben, 2017, “Interventionist DecisionTheory”,Synthese, 194(10): 4133–4153.doi:10.1007/s11229-016-1133-x
Suppes, Patrick, 1970,A Probabilistic Theory ofCausality, Amsterdam: North-Holland Publishing Company.
Tillman, Robert E., and Frederick Eberhardt, 2014, “LearningCausal Structure from Multiple Datasets with Similar VariableSets”,Behaviormetrika, 41(1): 41–64.doi:10.2333/bhmk.41.41
Triantafillou, Sofia, and Ioannis Tsamardinos, 2015,“Constraint-based Causal Discovery from Multiple Interventionsover Overlapping Variable Sets”,Journal of Machine LearningResearch, 16: 2147–2205. [Triantafillou & Tsamardinos 2015 available online]
Weslake, Brad, forthcoming, “A Partial Theory of ActualCausation”,British Journal for the Philosophy ofScience.
Woodward, James, 2003,Making Things Happen: A Theory ofCausal Explanation, Oxford: Oxford University Press.doi:10.1093/0195155270.001.0001
Wright, Sewall, 1921, “Correlation and Causation”,Journal of Agricultural Research, 20: 557–85.
Zhalama, Jiji Zhang, and Wolfgang Mayer, 2016, “WeakeningFaithfulness: Some Heuristic Causal Discovery Algorithms”,International Journal of Data Science and Analytics, 3(2):93–104. doi:10.1007/s41060-016-0033-y
Zhang, Jiji, 2008, “Causal Reasoning with AncestralGraphs”,Journal of Machine Learning Research, 9:1437–1474. [Zhang 2008 available online]
–––, 2013a, “A Lewisian Logic ofCounterfactuals”,Minds and Machines, 23(1):77–93. doi:10.1007/s11023-011-9261-z
–––, 2013b, “A Comparison of ThreeOccam’s Razors for Markovian Causal Models”,BritishJournal for Philosophy of Science, 64(2): 423–448.doi:10.1093/bjps/axs005
Zhang, Jiji and Peter Spirtes 2008, “Detection ofUnfaithfulness and Robust Causal Inference”,Minds andMachines, 18(2): 239–271.doi:10.1007/s11023-008-9096-4
–––, 2016, “The Three Faces ofFaithfulness”,Synthese, 193(4): 1011–1027.doi:10.1007/s11229-015-0673-9
Zhang, Kun, and Aapo Hyvärinen, 2009, “On theIdentifiability of the Post-nonlinear Causal Model”, in JeffBilmes and Andrew Ng (eds.),Proceeding of the Twenty-FifthConference on Uncertainty in Artificial Intelligence, (Corvallis,OR: AUAI Press), pp. 647–655.

Academic Tools

How to cite this entry.
Preview the PDF version of this entry at theFriends of the SEP Society.
Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO).
Enhanced bibliography for this entryatPhilPapers, with links to its database.

Other Internet Resources

Causal Analysis and Theory in Practice
Causality, 2nd Edition, 2009, Judea Pearl's web page on his book.
The Tetrad Project.
Causal and Statistical Reasoning, The Carnegie Mellon Curriculum, Core Site Materials.

Acknowledgments

Thanks to Frederick Eberhardt, Clark Glymour, Joseph Halpern, JudeaPearl, Peter Spirtes, Reuben Stern, Jiji Zhang, and Kun Zhang fordetailed comments, corrections, and discussion.

Portions of this entry are taken, with minimal adaptation, from theauthor’s separate entry onprobabilistic causationso that readers do not need to consult that entry for backgroundmaterial before reading this entry.

Open access to the SEP is made possible by a world-wide funding initiative.
The Encyclopedia Now Needs Your Support
Please Read How You Can Help Keep the Encyclopedia Free

Browse

About

Support SEP

Mirror Sites

View this site from another server:

USA (Main Site)Philosophy, Stanford University

Info about mirror sites

Library of Congress Catalog Data: ISSN 1095-5054

Movatterモバイル変換