“Probabilistic Causation” designates a group of theoriesthat aim to characterize the relationship between cause and effectusing the tools of probability theory. The central idea behind thesetheories is that causes change the probabilities of their effects.This article traces developments in probabilistic causation, includingrecent developments in causal modeling.
This entry surveys the main approaches to characterizing causation interms of probability. Section 1 provides some of the motivation forprobabilistic approaches to causation, and addresses a few preliminaryissues. Section 2 surveys theories that aim to characterize causationin terms of probability-raising. Section 3 surveys developments incausal modeling. Section 4 covers probabilistic accounts of actualcausation.
In this section, we will provide some motivation for trying tounderstand causation in terms of probabilities, and address a coupleof preliminary issues.
According to David Hume, causes are invariably followed by theireffects:
We may define a cause to bean object, followed by another, andwhere all the objects similar to the first, are followed by objectssimilar to the second. (1748: section VII)
Attempts to analyze causation in terms of invariable patterns ofsuccession are referred to as “regularity theories” ofcausation. There are a number of well-known problems facing regularitytheories, at least in their simplest forms, and these may be used tomotivate probabilistic approaches to causation. Moreover, an overviewof these difficulties will help to give a sense of the kinds ofproblem that any adequate theory of causation would have to solve.
(i)Imperfect Regularities. The first difficulty is that mostcauses are not invariably followed by their effects. For example,smoking is a cause of lung cancer, even though some smokers do notdevelop lung cancer. Imperfect regularities may arise for twodifferent reasons. First, they may arise because of theheterogeneity of circumstances in which the cause arises. Forexample, some smokers may have a genetic susceptibility to lungcancer, while others do not; some non-smokers may be exposed to othercarcinogens (such as asbestos), while others are not. Second,imperfect regularities may also arise because of a failure ofphysical determinism. If an event is not determined to occur,then no other event can be (or be a part of) a sufficient conditionfor that event. The success of quantum mechanics—and to a lesserextent, other theories employing probability—has shaken ourfaith in determinism. Thus it has struck many philosophers asdesirable to develop a theory of causation that does not presupposedeterminism.
The central idea behind probabilistic theories of causation is thatcauseschange the probability of their effects; an effect maystill occur in the absence of a cause or fail to occur in itspresence. Thus smoking is a cause of lung cancer, not because allsmokers develop lung cancer, but because smokers aremorelikely to develop lung cancer than non-smokers. This is entirelyconsistent with there being some smokers who avoid lung cancer, andsome non-smokers who succumb to it.
(ii)Irrelevance. A condition that is invariably followed bysome outcome may nonetheless be irrelevant to that outcome. Salt thathas been hexed by a sorcerer invariably dissolves when placed in water(Kyburg 1965), but hexing does not cause the salt to dissolve. Hexingdoes notmake a difference for dissolution. Probabilistictheories of causation capture this notion of making a difference byrequiring that a cause make a difference for the probability of itseffect.
(iii)Asymmetry. IfA causesB, then,typically,B will not also causeA. Smoking causes lungcancer, but lung cancer does not cause one to smoke. One way ofenforcing the asymmetry of causation is to stipulate that causesprecede their effects in time. But it would be nice if a theory ofcausation could provide some explanation of the directionality ofcausation, rather than merely stipulate it. Some proponents ofprobabilistic theories of causation have attempted to use theresources of probability theory to articulate a substantive account ofthe asymmetry of causation.
(iv)Spurious Regularities. Suppose that a cause is regularlyfollowed by two effects. Here is an example from Jeffrey (1969):Suppose that whenever the barometric pressure in a certain regiondrops below a certain level, two things happen. First, the height ofthe column of mercury in a particular barometer drops below a certainlevel. Shortly afterwards, a storm occurs. This situation is shownschematically in Figure 1. Then, it may well also be the case thatwhenever the column of mercury drops, there will be a storm. If so, asimple regularity theory would seem to rule that the drop of themercury columncauses the storm. In fact, however, theregularity relating these two events isspurious. The abilityto handle such spurious correlations is probably the greatest sourceof attraction for probabilistic theories of causation.
Figure 1
In this sub-section, we will review some of the basics of themathematical theory of probability, and introduce some notation.Readers already familiar the mathematics of probability may wish toskip this section.
Probability is a function, P, that assigns values between zero andone, inclusive. Usually the arguments of the function are taken to besets, or propositions in a formal language. The formal term for thesearguments is ‘events’. We will here use the notation thatis appropriate for propositions, with ‘\(\nsim\)’representing negation, ‘&’ representing conjunction,and ‘\(\vee\)’ representing disjunction. Sometimes whenthere is a long conjunction, this is abbreviated by using commas insteadof ampersands. The domain of a probability function has the structureof afield or aBoolean algebra. This means that thedomain is closed under complementation and the taking of finite unionsor intersections (for sets), or under negation, conjunction, anddisjunction (for propositions). Thus ifA andB areevents in the domain of P, so are \({\nsim}A\), \(A \amp B\), and \(A\vee B\).
Some standard properties of probability are the following:
In addition to probability theory, the entry will use basic notation from settheory and logic. Sets will appear in boldface.
Some further definitions:
If \(\PP(B) = 0\), then the ratio in the definition of conditionalprobability is undefined. There are, however, a variety of technicaldevelopments that will allow us to define \(\PP(A \mid B)\) when\(\PP(B)\) is 0. We will ignore this problem here.
As a convenient shorthand, a probabilistic statement that containsonly a variable or set of variables, but no values, will be understoodas a universal quantification over all possible values of thevariable(s). Thus if \(\bX = \{X_1 , \ldots ,X_m\}\) and \(\bY = \{Y_1, \ldots ,Y_n\}\), we may write
\[\PP(\bX \mid \bY) = \PP(\bX)\]as shorthand for
\[\begin{align} \forall x_1 \ldots \forall x_m\forall y_1 \ldots \forall y_n & [\PP(X_1 =x_1 ,\ldots ,X_m =x_m \mid Y_1 =y_1,\ldots ,Y_n =y_n)\\ & = \PP(X_1 =x_1 ,\ldots ,X_m =x_m)]\end{align}\](where the domain of quantification for each variable will be therange of the relevant random variable).
Causal relations are normally thought to be objective features of theworld. If they are to be captured in terms of probability theory, thenprobability assignments should represent some objective feature of theworld. There are a number of attempts to interpret probabilitiesobjectively, the most prominent beingfrequencyinterpretations andpropensity interpretations. Mostproponents of probabilistic theories of causation have understoodprobabilities in one of these two ways. Notable exceptions are Suppes(1970), who takes probability to be a feature of a model of ascientific theory; and Skyrms (1980), who understands the relevantprobabilities to be the subjective probabilities of a certain kind ofrational agent.
It is common to distinguish betweengeneral, ortype-level causation, on the one hand, andsingular,token-level oractual causation, on the other. Thisentry adopts the termsgeneral causation andactualcausation. Causal claims usually have the structure‘C causesE’.C andE are therelata of the causal claim; we will discuss causal relata inmore detail in the next section. General causation and actualcausation are often distinguished by their relata.Generalcausal claims, such “smoking causes lung cancer” typicallydo not refer to particular individuals, places, or times, but only toevent-types or properties.Singular causal claims, such as“Jill’s heavy smoking during the 2000s caused her todevelop lung cancer”, typically do make reference to particularindividuals, places, and times. This is an imperfect guide, however;for example, some theories of general causation to be discussed belowtake their causal relata to be time-indexed.
A related distinction is that general causation is concerned with afull range of possibilities, whereas actual causation is concernedwith how events actually play out in a specific case. At a minimum, inclaims of actual causation, “cause” functions as a successverb. The claim “Jill’s heavy smoking during the 2000scaused her to develop lung cancer” implies that Jill smokedheavily during the 2000s and that she developed lung cancer.
The theories to be discussed in Sections2 and3 below primarily concern general causation, whileSection 4 discusses theories of actual causation.
A number of different candidates have been proposed for the relata ofcausal relations. The relata of actual causal relations are oftentaken to beevents (not to be confused with events in thepurely technical sense), although some authors (e.g., Mellor 2004)argue that they arefacts. The relata of general causalrelations are often taken to beproperties orevent-types. For purposes of definiteness,events will refer to the relata of actual causation, andfactors will refer to the relata of general causation. Theseterms are not intended to imply a commitment to any particular view onthe nature of the causal relata.
In probabilistic approaches to causation, causal relata arerepresented by events or random variables in a probability space.Since the formalism requires us to make use of negation, conjunction,and disjunction, the relata must be entities (or be accuratelyrepresented by entities) to which these operations can be meaningfullyapplied.
In some theories, the time at which an event occurs or a property isinstantiated plays an important role. In such cases, it will be usefulto include a subscript indicating the relevant time. Thus the relatamight be represented by \(C_t\) and \(E_{t'}\). If the relata areparticular events, this subscript is just a reminder; it adds nofurther information. For example, if the event in question is theopening ceremony of the Rio Olympic games, the subscript‘8/5/2016’ is not necessary to disambiguate it from otherevents. In the case of properties or event-types, however, suchsubscripts do add further information. The time index need not referto a date or absolute time. It could refer to a stage in thedevelopment of a particular kind of system. For example, exposure tolead paint in children can cause learning disabilities. Here the timeindex would indicate that it is exposurein children, thatis, in the early stages of human life, that causes the effect inquestion. The time indices may also indicate relative times. Exposureto the measles virus causes the appearance of a rash approximately twoweeks later. We might indicate this time delay by assigning exposure atime index of \(t = 0\), and rash an index of \(t = 14\) (for 14days).
It is standard to assume that causes and effects must bedistinct from one another. This means that they must notstand in logical relations or part-whole relations to one another.Lewis 1986a contains a detailed discussion of the relevant notion ofdistinctness. We will typically leave this restriction tacit.
Psillos 2009 provides an overview of regularity theories of causation.Lewis 1973 contains a brief but clear and forceful overview ofproblems with regularity theories. The entry forscientific explanation contains discussions of some of these problems.
Hájek and Hitchcock 2016b is a short introduction to probabilitytheory geared toward philosophical applications. Billingsley 1995 andFeller 1968 are two standard texts on probability theory. The entryforinterpretations of probability includes a brief introduction to the formalism of probability theory,and discusses the various interpretations of probability. Galavotti2005 and Gillies 2000 are good surveys of philosophical theories ofprobability. Hájek and Hitchcock 2016a includes essays covering themajor interpretations of probability.
The Introduction of Eells 1991 provides a good overview of thedistinction between general and actual causation.
Bennett 1988 is an excellent discussion of facts and events in thecontext of causation. Ehring 2009 is a survey of views about causalrelata. See also the entries forthe metaphysics of causation,events,facts, andproperties.
The theories canvassed in this section all develop the basic idea thatcauses raise the probability of their effects. These theories wereamong the leading theories of causation during the second half of the20th century. Today, they have largely been supplanted bythe causal modeling approaches discussed inSection 3.
The central idea that causes raise the probability of their effectscan be expressed formally using conditional probability.Craises the probability ofE just in case:
In words, the probability thatE occurs, given thatCoccurs, is higher than the unconditional probability thatEoccurs. Alternately, we might say thatC raises the probabilityofE just in case:
the probability thatE occurs, given thatC occurs, ishigher than the probability thatE occurs, given thatCdoes not occur. These two formulations turn out to be equivalent inthe sense that inequality \(\PR_1\) will hold just in case \(\PR_2\)holds. Some authors (e.g., Reichenbach 1956, Suppes 1970, Cartwright1979) have formulated probabilistic theories of causation usinginequalities like \(\PR_1\), others (e.g., Skyrms 1980, Eells 1991)have used inequalities like \(\PR_2\). This difference is mostlyimmaterial, but for consistency we will stick with (\(\PR_2)\). Thus afirst stab at a probabilistic theory of causation would be:
PR has some advantages over the simplest version of a regularitytheory of causation (discussed inSection 1.1 above). PR is compatible with imperfect regularities:C mayraise the probability ofE even though instances ofCare not invariably followed by instances ofE. Moreover, PRaddresses the problem of relevance: ifC is a cause ofE, thenC makes a difference for the probability ofE. But as it stands, PR does not address either the problem ofasymmetry, or the problem of spurious correlations. PR does notaddress the problem of asymmetry because probability-raising turns outto be symmetric: \(\PP(E \mid C) \gt \PP(E \mid {\nsim}C)\),if and only if \(\PP(C \mid E) \gt \PP(C \mid {\nsim}E)\).Thus PR by itself cannot determine whetherC is the cause ofE or vice versa. PR also has trouble withspuriouscorrelations. IfC andE are both caused by somethird factor,A, then it may be that \(\PP(E \mid C) \gt\PP(E \mid {\nsim}C)\) even thoughC does not causeE. This is the situation shown inFigure 1 above. Here,C is the drop in the level of mercury in abarometer, andE is the occurrence of a storm. Then we wouldexpect that \(\PP(E \mid C) \gt \PP(E \mid {\nsim}C)\). Inthis case, atmospheric pressure is referred to as aconfounding factor.
Hans Reichenbach’sThe Direction of Time was publishedposthumously in 1956. In it, Reichenbach is concerned with the originsof temporally asymmetric phenomena, particularly the increase inentropy dictated by the second law of thermodynamics. In this work, hepresents the first fully developed probabilistic theory of causation,although some of the ideas can be traced back to an earlier paper from1925 (Reichenbach 1925).
Reichenbach introduced the terminology ofscreening off todescribe a particular type of probabilistic relationship. If \(\PP(E\mid A \amp C) = \PP(E \mid C)\), thenC is said toscreenA off fromE. When \(\PP(A \amp C) \gt 0\), thisequality is equivalent to \(\PP(A \amp E \mid C) = \PP(A \midC) \times \PP(E \mid C)\); i.e.,A andE areprobabilistically independent conditional uponC.
Reichenbach recognized that there were two kinds of causal structurein whichC will typically screenA off fromE.The first occurs whenA causesC, which in turn causesE, and there is no other route or process by whichAeffectsE. This is shown in Figure 2.
Figure 2
In this case, Reichenbach said thatC iscausallybetweenA andE. We might say thatC is anintermediate cause betweenA andE, or thatC isa proximate cause ofE andA a distal cause ofE.For example, unprotected sex (A) causes AIDS (E) only bycausing HIV infection (C). Then we would expect that amongthose already infected with HIV, those who became infected throughunprotected sex would be no more likely to contract AIDS than thosewho became infected in some other way.
The second type of case that produces screening off occurs whenC is a common cause ofA andE, such as in thebarometer example depicted inFigure 1. A drop in atmospheric pressure (C) causes both a drop in thelevel of mercury in a barometer (A) and a storm (E).(This notation is slightly different from one used earlier.) Theatmospheric pressure will screen off the barometer reading from theweather: given that the atmospheric pressure has dropped, thereading of the barometer makes no difference for the probability ofwhether a storm will occur.
Reichenbach used the apparatus of screening off to address the problemof spurious correlations. In our example, while a drop in the columnof mercury (A) raises the probability of a storm (E)overall, it does not raise the probability of a storm when we furthercondition on the atmospheric pressure. That is, ifA andE are spuriously correlated, thenA will be screened offfromE by a common cause. More specifically, suppose that\(C_t\) and \(E_{t'}\) are events that occur at timest and\(t'\) respectively. Then
Note the restriction of \(t''\) to times earlier than orsimultaneously with the occurrence of \(C_t\). That is because causalintermediates between \(C_t\) and \(E_{t'}\) will often screen \(C_t\)off from \(E_{t'}\). In such cases we still want to say that \(C_t\)is a cause of \(E_{t'}\), albeit a distal or indirect cause.
Suppes (1970) independently offered an equivalent definition ofcausation, although his motivation for the no-screening-off conditionwas different from Reichenbach’s. Suppes extended the frameworkin a number of directions. While Reichenbach was interested inprobabilistic causation primarily in connection with issues that arisewithin the foundations of statistical mechanics, Suppes was interestedin defining causation within the framework of probabilistic models ofscientific theories. For example, Suppes offers an extended discussionof causation in the context of psychological models of learning.
Reichenbach (1956) formulated a principle he dubbed the ‘Common CausePrinciple’ (CCP). Suppose that eventsA andB arepositively correlated, i.e., that
But suppose that neitherA norB is a cause of theother. Then Reichenbach maintained that there will be a common cause,C, ofA andB, satisfying the followingconditions:
When eventsA,B, andC satisfy these conditions,they are said to form aconjunctive fork. 5 and 6 follow fromC being a cause ofA and a cause ofB. Conditions2 and 3 stipulate thatC and \({\nsim}C\) screen offAfromB.
Conditions 2 through 6 mathematically entail 1. Reichenbach says thatthe common causeexplains the correlation betweenAandB. The idea is that probabilistic correlations that are notthe result of one event causing another are ultimately derived fromprobabilistic correlations that do result from a causalrelationship.
Reichenbach’s definition of causation, discussed inSection 2.2 above, appeals to time order: it requires that a cause occur earlierthan its effect. But Reichenbach also thought that the direction fromcauses to effects can be identified with a pervasive statisticalasymmetry. Suppose that eventsA andB are correlated,and thatC satisfies conditions 2–6 above, so thatABC form a conjunctive fork. IfC occurs earlier thanA andB, and there is no event satisfying 2 through 6that occurs later thanA andB, thenACB is saidto form a conjunctive forkopen to the future. Analogously,if there is a later event satisfying 2 through 6, but no earlierevent, we have a conjunctive forkopen to the past. If anearlier eventC and a later eventD both satisfy 2through 6, thenACBD forms aclosed fork.Reichenbach’s proposal was that the direction from cause toeffect is the direction in which open forks predominate. In our world,there are a great many forks open to the future, few or none open tothe past. However, we shall see insection 3.6 below that conjunctive forks are not the best structures foridentifying causal direction.
In the Reichenbach-Suppes definition of causation, the inequality\(\PP(E_{t'} \mid C_t) \gt \PP(E_{t'} \mid {\nsim}C_t)\) isnecessary, but not sufficient, for causation. It is not sufficient,because it may hold in cases where \(C_t\) and \(E_{t'}\) share acommon cause. Unfortunately, common causes can also give rise to caseswhere this inequality is not necessary for causation either. Suppose,for example, that smoking is highly correlated with living in thecountry: those who live in the country are much more likely to smokeas well. Smoking is a cause of lung cancer, but suppose that citypollution is an even stronger cause of lung cancer. Then it may bethat smokers are, over all, less likely to suffer from lung cancerthan non-smokers. LettingC represent smoking,B livingin the country, andE lung cancer, \(\PP(E \mid C) \lt\PP(E \mid {\nsim}C)\). Note, however, that if we conditionalizeon whether one lives in the country or in the city, this inequality isreversed: \(\PP(E \mid C \amp B) \gt \PP(E \mid {\nsim}C \ampB)\), and \(\PP(E \mid C \amp{\nsim}B) \gt \PP(E \mid {\nsim}C\amp{\nsim}B)\). Such reversals of probabilistic inequalities areinstances of “Simpson’s Paradox”. The problem thatSimpson’s paradox creates for probabilistic theories ofcausation was pointed out by Nancy Cartwright (1979) and Brian Skyrms(1980) at about the same time.
Cartwright and Skyrms sought to rectify the problem by replacingconditions (ii) and (iii) ofReich with the requirement that causes must raise the probabilities oftheir effects in variousbackground contexts. Cartwrightproposed the following definition:
Skyrms proposed a slightly weaker condition: a cause must raise theprobability of its effect in at least one background context, andlower it in none. A background context is a conjunction of factors.When such a conjunction of factors is conditioned on, those factorsare said to be “held fixed”. To specify what thebackground contexts will be, then, we must specify what factors are tobe held fixed. In the previous example, we saw that the true causalrelevance of smoking for lung cancer was revealed when we held countryliving fixed, either positively (conditioning on \(B)\) or negatively(conditioning on \({\nsim}B)\). This suggests that in evaluating thecausal relevance ofC forE, we need to hold fixed othercauses ofE, either positively or negatively. This suggestionis not entirely correct, however. LetC andE be smokingand lung cancer, respectively. SupposeD is a causalintermediary, say the presence of tar in the lungs. IfC causesE exclusively viaD, thenD will screenCoff fromE: given the presence (absence) of tar in the lungs,the probability of lung cancer is not affected by whether the tar gotthere by smoking. Thus we will not want to hold fixed any causes ofEthat are themselves caused by C. Let us call the setof all factors that are causes ofE, but are not caused byC, the set ofindependent causes ofE. Abackground context forC andE will then be a maximalconjunction, each of whose conjuncts is either an independent cause ofE, or the negation of an independent cause ofE.
Note that the specification of factors that need to be held fixedappeals to causal relations, so the theory no longer offers areductive analysis of causation. Nonetheless, the theoryimposes probabilistic constraints upon possible causal relations inthe sense that a given set of probability relations will beincompatible with at least some systems of causal relations. Note alsothat we have dropped the subscripts referring to times. Cartwrightclaimed that it is not necessary to appeal to the time order of eventsto distinguish causes from effects in her theory. That is because itwill no longer be true in general that ifC raises theprobability ofE in every relevant background contextB,thenE raise will raise the probability ofC in everybackground context \(B'\). The reason is that the construction of thebackground contexts ensures that the background contexts relevant toassessingC’s causal relevance forE are differentfrom those relevant to assessingE’s causal relevance forC. However, Davis (1988) and Eells (1991) both argue cogentlythat Cartwright’s account will still sometimes rule that effectsbring about their causes.
Cartwright defined a cause as a factor that increases the probabilityof its effect in every background context. But it is easy to see thatthere are other possible probability relations betweenC andE. Eells (1991) proposes the following taxonomy:
\(C_t\) iscausally relevant for \(E_{t'}\) if and only if itis a positive, negative, or mixed cause of \(E_{t'}\); i.e., if andonly if \(t \lt t'\) and \(C_t\) is not causally neutral for\(E_{t'}\).
It should be apparent that when constructing background contexts forC andE one should hold fixed not only (positive) causesofE that are independent of \(C,\) but also negative and mixedcauses ofE; in other words, one should hold fixed all factorsthat are causally relevant forE, except those for whichC is causally relevant. This suggests that causal relevance,rather than positive causation, is the most basic metaphysicalconcept.
Eells’s taxonomy brings out an important distinction. It is onething to ask whetherC is causally relevant toEinsome way; it is another to askin which wayC iscausally relevant toE. To say thatC causesE isthen potentially ambiguous: it might mean thatC is causallyrelevant toE; or it might mean thatC is a positivecause ofE. Probabilistic theories of causation can be used toanswer both types of question.
Eells claims that general causal claims must be relativized to apopulation. A very heterogeneous population will include a great manydifferent background conditions, while a homogeneous population willcontain few. A heterogeneous population can always be subdivided intohomogeneous subpopulations. It will often happen thatC is amixed cause ofE relative to a population P, while being apositive cause, negative cause, or causally neutral forE invarious subpopulations of P.
According to bothCart andEells, a cause must raise the probability of its effect ineverybackground context. This has been called the requirement ofcontextual-unanimity. Dupré (1984) raises the followingcounterexample to the contextual unanimity requirement. Suppose thatthere is a very rare gene that has the following effect: those thatpossess the gene have their chances of contracting lung cancerlowered when they smoke. In this scenario, there would be abackground context in which smoking lowers the probability of lungcancer: thus smoking would not be a cause of lung cancer according tothe contextual-unanimity requirement. Nonetheless, it seems unlikelythat the discovery of such a gene would lead us to abandon the claimthat smoking causes lung cancer.
Dupré suggests instead that we should deemC to be a cause ofE if it raises the probability ofE in a ‘fairsample’—a sample that is representative of the populationas a whole. Mathematically, this amounts to the requirement that
whereB ranges over the relevant background contexts. This isthe same as requiring thatC must raise the probability ofE in aweighted average of background contexts, whereeach background context is weighted by the product of \(\PP(B)\) andthe absolute value of
\[\PP(E \mid C \amp B) - \PP(E \mid {\nsim}C \amp B).\]Dupré’s account surely comes closer to capturing our ordinaryuse of causal language. Indeed, the inequality inDupré is what one looks for in randomized trials. If one randomlydetermines which members of a population receive a treatment(C) and which do not \(({\nsim}C)\), then the distribution ofbackground conditionsB ought to be the same in both groups,and ought to reflect the frequency of these conditions in thepopulation. Thus we would expect the frequency ofE to behigher in the treatment group just in case inequalityDupréholds.
On the other hand, Eells’s population-relative formulation allowsus to make more precise causal claims: in the population as a whole,smoking is a mixed cause of lung cancer; in the sub-population ofindividuals who lack the protective gene, smoking is a positive causeof lung cancer; in the sub-population consisting of individuals whopossess the gene, smoking is a negative cause of lung cancer.
In any event, this debate does not really seem to be about themetaphysics of causation. As we saw in the previous section, causalrelevance is really the basic metaphysical concept. The disputebetween Dupré and Eells is really a debate about how best to use theword ‘cause’ to pick out a particular species of causalrelevance. Dupré’s proposed usage will count as (positive)causes many things that will be mixed causes in Eells’s proposedusage. But there does not seem to be any underlying disagreement aboutwhich factors are causally relevant. (For defense of a similarposition, see Twardy and Korb 2004.)
The program described in this section did much to illuminate therelationship between causation and probability. In particular, ithelped us to better understand the way in which causal structure cangive rise to probabilistic relations of screening off. However,despite the mathematical framework of the program, and points ofcontact with statistics and experimental methodology, this program didnot give rise to any new computational tools, or suggest any newmethods for detecting causal relationships. For this reason, theprogram has largely been supplanted by the causal modeling toolsdescribed in the next section.
The main works surveyed in this section are Reichenbach 1956, Suppes1970, Cartwright 1979, Skyrms 1980, and Eells 1991. Williamson 2009and Hitchcock 2016 are two further surveys that cover a number of thetopics discussed in this section. The entries forHans Reichenbach andReichenbach’s Common Cause Principle include discussions of Reichenbach’s program and the status ofhis Common Cause Principle. Salmon (1984) contains an extensivediscussion of conjunctive forks. The entry forSimpson’s paradox contains further discussion of some of the issues raised inSection 2.4.
The discussion of the previous section conveys some of the complexityof the problem of inferring causal relationships from probabilisticcorrelations. Fairly recently, a number of techniques have beendeveloped for representing systems of causal relationships, and forinferring causal relationships from probabilities. The name‘causal modeling’ is often used to describe the newinterdisciplinary field devoted to the study of methods of causalinference. This field includes contributions from statistics,artificial intelligence, philosophy, econometrics, epidemiology, andother disciplines. Within this field, the research programs that haveattracted the greatest philosophical interest are those of thecomputer scientist Judea Pearl and his collaborators, and of thephilosophers Peter Spirtes, Clark Glymour, and Richard Scheines (SGS)and their collaborators. The most significant works of these authorsare Pearl (2009) (first published in 2000), and Spirtes et al. (2000)(first published in 1993).
Every causal model involves a set of variables \(\bV\). The variablesin \(\bV\) may include, for example, the education-level, income, andoccupation of an individual. A variable could be binary, its valuesrepresenting the occurrence or non-occurrence of some event, or theinstantiation or non-instantiation of some property. But as theexample of income suggests, a variable could have multiple values oreven be continuous.
A probabilistic causal model also includes a probability measure P. Pis defined over propositions of the form \(X = x\), whereX isa variable in \(\bV\) andx is a value in the range ofX. P is also defined over conjunctions, disjunctions, andnegations of such propositions. It follows that conditionalprobabilities over such propositions will be well-defined whenever theevent conditioned on has positive probability. P is usually understoodto represent some kind of objective probability.
Causal relationships among the variables in \(\bV\) are represented bygraphs. We will consider two types of graphs. The first isthedirected acyclic graph (DAG). Adirected graph\(\bG\) on variable set \(\bV\) is a set of ordered pairs of variablesin \(\bV\). We represent this visually by drawing an arrow fromX toY just in case \(\langle X, Y\rangle\) is in\(\bG\). Figure 3 shows a directed graph on variable set \(\bV = \{S,T, W, X, Y, Z\}\).
Figure 3
Apath in a directed graph is a non-repeating sequence ofarrows that have endpoints in common. For example, there is a pathfromX toZ, which we can write as \(X \leftarrow T\rightarrow Y \rightarrow Z\). Adirected path is a path inwhich all the arrows align by meeting tip-to-tail; for example, thereis a directed path \(S \rightarrow T \rightarrow Y \rightarrow Z\). Adirected graph isacyclic, and hence a DAG, if there is nodirected path from a variable to itself. The graph inFigure 3 is a DAG.
The relationships in the graph are often described using the languageof genealogy. The variableX is aparent ofYjust in case there is an arrow directed fromX toY.\(\PA(Y)\) will denote the set of all parents ofY. InFigure 3, \(\PA(Y) = \{T, W\}\).X is anancestor ofY(andY is adescendant ofX) just in case thereis a directed path fromX toY. However, it will beconvenient to deviate slightly from the genealogical analogy anddefine ‘descendant’ so that every variable is alsoa descendant of itself. \(\DE(X)\) denotes the set of all descendants ofX. InFigure 3 \(\DE(T) = \{T, X, Y, Z\}\).
An arrow fromY toZ in a DAG represents thatYis adirect cause ofZ. Roughly, this means that thevalue ofY makes some causal difference for the value ofZ, and thatY influencesZ through some processthat is not mediated by any other variable in \(\bV\). Directness isrelative to a variable set. We will call the system of direct causalrelations represented in a DAG such asFigure 3 thecausal structure on the variable set \(\bV\).
A second type of graph that we will consider is anacyclicdirected mixed graph (ADMG). AnADMG, will contain double-headed arrows, as well as single-headedarrows. A double-headed arrow represents alatent commoncause. A latent common cause of variablesX andY is acommon cause that is not included in the variable set \(\bV\). Forexample, suppose thatX andY share a common causeL (Figure 4(a)). An ADMG on the variable set \(\bV = \{X, Y\}\)will look like Figure 4(b).
| (a) | (b) |
Figure 4
We only need to represent missing common causes in this way when theyareclosest common causes. That is, a graph on \(\bV\) shouldcontain a double-headed arrow betweenX andY when thereis a variableL that is omitted from \(\bV\), such that ifL were added to \(\bV\) it would be adirect cause ofX andY. Double-headed arrows do not give rise to“genealogical” relationships: inFigure 4(b),X is not a parent, ancestor, or descendant ofY.
In an ADMG, we expand the definition of apath to includedouble-headed arrows. Thus, \(X \leftrightarrow Y\) is a path in theADMG shown inFigure 4(b).Directed path retains the same meaning, and a directed pathcannot contain double-headed arrows. An ADMG cannot include a directedpath from a variable to itself.
We will adopt the convention that both DAGs and ADMGs represent thepresenceand absence of both direct causal relationships andlatent common causes. For example the DAG inFigure 3 represents thatT is a direct cause ofY, thatTis not a direct cause ofZ, and that there are no latent commoncauses of any variables.
We will be interested in a variety of problems that have a generalstructure. There will be aquery concerning some causalfeature of the system being investigated. A query may concern:
A given problem will also have a set of inputs. These fall into avariety of categories:
In realistic scientific cases, we never directly observe the trueprobability distribution P over a set of variables. Rather, we observefinite data that approximate the true probability when sample sizesare large enough and observation protocols are well-designed. Sinceour primary concern is with the philosophical issue of howprobabilities determine or constrain causal structure, we will notaddress these important practical concerns. An answer to a query thatcan be determined from the true probabilities is said to beidentifiable. For instance, if we can determine the correctDAG on a variable set \(\bV\) from the probability distribution on\(\bV\), the DAG is identifiable.
The most important principle connecting the causal structure on\(\bV\), as represented in a graph \(\bG\), and the probabilitydistribution P on \(\bV\) is theMarkov Condition (MC). Letus first consider the case where \(\bG\) is a DAG. Then P satisfiestheMarkov Condition (MC) relative to \(\bG\) if and only itsatisfies these three conditions:
| (MCScreening_off) | For every variableX in \(\bV\), and every set of variables\(\bY \subseteq \bV \setminus \DE(X)\),\(\PP(X \mid \PA(X) \amp \bY) = \PP(X \mid \PA(X))\). |
| (MCFactorization) | Let \(\bV = \{X_1, X_2 , \ldots ,X_n\}\). Then\(\PP(X_1, X_2 , \ldots ,X_n) = \prod_i \PP(X_i \mid \PA(X_i))\). |
| (MCd-separation) | Let \(X, Y \in \bV, \bZ \subseteq \bV \setminus \{X, Y\}\). Then\(\PP(X, Y \mid \bZ) = \PP(X \mid \bZ) \times \PP(Y \mid\bZ)\)if \(\bZ\)d-separatesX andY in \(\bG\)(explained below). |
These three conditions are equivalent when \(\bG\) is a DAG.
Let us take some time to explain each of these formulations.
MCScreening_off says that the parents of variableXscreenX off from all other variables, except for thedescendants ofX. Given the values of the variables that areparents ofX, the values of the variables in \(\bY\) (whichincludes no descendants of \(X)\), make no further difference to theprobability thatX will take on any given value.
MCFactorization tells us that once we know the conditionalprobability distribution of each variable given its parents, \(\PP(X_i\mid \PA(X_i))\), we can compute the complete joint distributionover all of the variables. This captures Reichenbach’s idea thatprobability relations between variables that are not related as causeand effect are nonetheless derived from probability relations betweencauses and effects.
MCd-separation uses the graphical notion ofd-separation, introduced by Pearl (1988). Let \(X, Y \in \bV,\bZ \subseteq \bV \setminus \{X, Y\}\). As noted above, a path fromX toY is a sequence of variables \(\langle X = X_1 ,\ldots ,X_k = Y\rangle\) such that for each \(X_i\), \(X_{i+1}\),there is either an arrow from \(X_i\) to \(X_{i+1}\) or an arrow from\(X_{i+1}\) to \(X_i\) in \(\bG\). A variable \(X_i , 1 \lt i \lt k\)is acollider on the path just in case there is an arrow from\(X_{i-1}\) to \(X_i\) and from \(X_{i+1}\) to \(X_i\). That is,\(X_i\) is a collider on a path just in case two arrows converge on\(X_i\) in the path. \(\bZ\)d-separatesX andYjust in case every path \(\langle X = X_1 , \ldots ,X_k = Y\rangle\)fromX toY contains at least one variable \(X_i\) suchthat either: (i) \(X_i\) is a collider, and no descendant of \(X_i\)(including \(X_i\) itself) is in \(\bZ\); or (ii) \(X_i\) is not acollider, and \(X_i\) is in \(\bZ\). MCd-separationstates thatd-separation is sufficient for conditionalindependence.
Note that MC provides sufficient conditions for variables to beprobabilistically independent, conditional on others, but no necessarycondition. The Markov Condition entails many of the same screening offrelations as Reichenbach’s Common Cause Principle, discussed inSection 2.3 above. Here are some examples:
Figure 5
In Figure 5, MC implies thatX screensY off from all ofthe other variables, and thatW screensZ off from allof the other variables. This is most easily seen from MCScreeningoff.W also screensT off from all of the othervariables, which is most easily seen fromMCd-separation. MC does not imply thatTscreensY off fromZ (or indeed anything from anything).WhileY andZ do have a common cause that screens themoff (W), not all common causes screen them off (T doesnot have to), and not everything that screens them off is a commoncause (X screens them off but is not a common cause).
Figure 6
In Figure 6, MC entails thatX andY will beunconditionally independent, but not that they will be independentconditional onZ. This is most easily seen fromMCd-separation.
MC is not expected to hold for arbitrary sets of variables \(\bV\),even when the graph \(\bG\) accurately represents the causal relationsamong those variables. For example, MC will typically fail in thefollowing kinds of case:
If there are latent common causes, we expect MCScreeningoff and MCFactorization to fail if we apply them in anaïve way. For example, suppose that the true causal structure on\(\bV = \{X, Y, Z\}\) is shown by the ADMG in Figure 7.
Figure 7
Y is the only parent ofZ shown in the graph, and if wetry to apply MCScreening_off, it tells us thatYshould screenX off fromZ. However, we would expectX andZ to be correlated, even when we condition onY, due to the latent common cause. The problem is that thegraph is missing a relevant parent ofZ, namely the omittedcommon cause. However, suppose that the probability distribution issuch thatif the latent causeL were added, theprobability distribution over the expanded set of variables wouldsatisfy MC with respect to the resulting DAG. Then it turns out thatthe probability distribution willstill satisfyMCd-separation with respect to the ADMG ofFigure 8. This requires us to expand the definition ofd-separation toinclude paths with double-headed arrows. For instance,Z is acollider on the path \(Y \rightarrow Z \leftrightarrow X\) (sinceZ has two arrows pointing into it), butX is not acollider on the path \(Y \leftarrow X \leftrightarrow Z\). Thus wewill say that a probability distribution P satisfies the MarkovCondition relative to an ADMG just in case it satisfiesMCd-separation.
Both SGS 2000 and Pearl 2009 contain statements of a principle calledtheCausal Markov Condition (CMC), but they mean differentthings. In Pearl’s formulation, CMC is just a statement of amathematical theorem: Pearl and Verma (1991) prove if each variable in\(\bV\) is a deterministic product of its parents in \(\bV\), togetherwith an error term; and the errors are probabilistically independentof each other; then the probability distribution on \(\bV\) willsatisfy MC with respect to the DAG \(\bG\). Pearl interprets thisresult in the following way: Macroscopic systems, he believes, aredeterministic. In practice, however, we never have access to all ofthe causally relevant variables affecting a macroscopic system. But ifwe include enough variables in our model so that the excludedvariables are probabilistically independent of one another, then ourmodel will satisfy the MC, and we will have a powerful set of analytictools for studying the system. Thus MC characterizes a point at whichwe have constructed a useful approximation of the complete system.
In SGS 2000, the CMC has more the status of an empirical posit. If\(\bV\) is set of macroscopic variables that are well-chosen, meaningthat they are free from the sorts of defects described in points (ii)and (iii) above; \(\bG\) is a graph representing the causal structureon \(\bV\); and P is the objective probability distribution resultingfrom this causal structure; then P can be expected to satisfy MCrelative to \(\bG\). More precisely, P will satisfy all three versionsof MC if \(\bG\) is a directed acyclic graph, and P will satisfyMCd-separation if \(\bG\) is an ADMG withdouble-headed arrows. SGS defend this empirical posit in two differentways:
Cartwright (1993, 2007: chapter 8) has argued that MC need not holdfor genuinely indeterministic systems. Hausman and Woodward (1999,2004) attempt to defend MC for indeterministic systems.
A causal model that comprises a DAG and a probability distributionthat satisfies MC is called acausal Bayes net (CBN). Acausal model incorporating an ADMG and probability distributionsatisfying MCd-separation is called asemi-Markov causal model (SMCM).
The MC states a sufficient condition but not a necessary condition forconditional probabilistic independence. As such, the MC by itself cannever entail that two variables are conditionally or unconditionallydependent. The Minimality and Faithfulness Conditions are twoprinciples that posit necessary conditions for probabilisticindependence. The terminology comes from Spirtes et al. (2000). Pearlprovides analogous conditions with different terminology.
(i)The Minimality Condition. Suppose that the acyclicdirected graph \(\bG\) on variable set \(\bV\) satisfies MC withrespect to the probability distribution P. The Minimality Conditionasserts that no sub-graph of \(\bG\) over \(\bV\) also satisfies theMarkov Condition with respect to P. (A subgraph of \(\bG\) is a graphover \(\bV\) that results from removing arrows from \(\bG)\). As anillustration, consider the variable set \(\{X, Y\}\), let there be anarrow fromX toY, and suppose thatX andY are probabilistically independent of each other according toprobability function P. This graph would satisfy the MC with respectto P: none of the independence relations mandated by the MC are absent(in fact, the MC mandates no independence relations). But this graphwould violate the Minimality Condition with respect to P, since thesubgraph that omits the arrow fromX toY would alsosatisfy the MC. The Minimality Condition implies that if there is anarrow fromX toY, thenX makes a probabilisticdifference forY, conditional on the other parents ofY.In other words, if \(\bZ = \PA(Y) \setminus \{X\}\), there exist\(\bz\),y,x, \(x'\) such that
\[\PP(Y = y \mid X = x \amp \bZ = \bz) \ne \PP(Y = y \mid X = x' \amp \bZ = \bz).\](ii)The Faithfulness Condition. The Faithfulness Conditionsays that all of the (conditional and unconditional) probabilisticindependencies that exist among the variables in \(\bV\) arerequired by the MC. For example, suppose that \(\bV = \{X, Y,Z\}\). Suppose also thatX andY are unconditionallyindependent of one another, but dependent, conditional uponZ.(The other two variable pairs are dependent, both conditionally andunconditionally.) The graph shown inFigure 8 does not satisfy the faithfulness condition with respect to thisdistribution (colloquially, the graph is not faithful to thedistribution). MC, when applied to the graph ofFigure 8, does not imply the independence ofX andY. Bycontrast, the graph shown inFigure 6 above is faithful to the described distribution. Note thatFigure 8 does satisfy the Minimality Condition with respect to thedistribution; no subgraph satisfies MC with respect to the describeddistribution. In fact, the Faithfulness Condition is strictly strongerthan the Minimality Condition.
Figure 8
The Faithfulness Condition implies that the causal influences of onevariable on another along multiple causal routes do not‘cancel’. In Figure 8,X influencesY alongtwo different directed paths. If the effect of one path is to exactlyundo the influence along the other path, thenX andYwill be probabilistically independent. The Faithfulness Conditionforbids such exact cancellation. This ‘no canceling’condition seems implausible as a metaphysical or conceptual constraintupon the connection between causation and probabilities. For example,if one gene codes for the production of a particular protein, andsuppresses another gene that codes for the same protein, the operationof the first gene will be independent of the presence of the protein.Cartwright (2007: chapter 6) and Andersen (2103) argue that violationsof faithfulness are widespread.
The Faithfulness Condition is amethodological principlerather than a metaphysical principle. Given a distribution on \(\{X,Y, Z\}\) in whichX andY are independent, we shouldinfer that the causal structure is that depicted inFigure 6, rather thanFigure 8. This is not becauseFigure 8 is conclusively ruled out by the distribution, but rather because itis preferable to postulate a causal structure thatimpliesthe independence ofX andY rather than one that ismerelyconsistent with independence.
The original hope of Reichenbach and Suppes was to provide a reductionof causation to probabilities. To what extent has this hope beenrealized within the causal modeling framework? Causal modeling doesnot offer a reduction in the traditional philosophical sense; that is,it does not offer an analysis of the form ‘X causesY if and only if…’ where the right hand side ofthe bi-conditional makes no reference to causation. Instead, it offersa series of postulates about how causal structure constrains thevalues of probabilities. Still, if we have a set of variables \(\bV\)and a probability distribution P on \(\bV\), we may ask if P sufficesto pick out a unique causal graph \(\bG\) on \(\bV\).
Pearl (1988: Chapter 3) proves the following theorem:
If
then it will be possible to uniquely identify \(\bG\) on the basis ofP.
In many ways, this result successfully executes the sort of projectdescribed inSection 2above. That is, making the same sorts of assumptions abouttime-indexing, and substantive assumptions about the connectionbetween probability and causation, it establishes that it is possibleto identify causal structure using probabilities.
If we don’t have information about time ordering, or othersubstantive assumptions restricting the possible causal structuresamong the variables in \(\bV\), then it will not always be possible toidentify the causal structure from probability alone. In general,given a probability distribution P on \(\bV\), it is only possible toidentify aMarkov equivalence class of causal structures.This will be the set of all DAGs on \(\bV\) that (together with MC)imply all and only the conditional independence relations contained inP. ThePC algorithm (SGS 2000: 84–85), named for itstwo creators (Peter Spirtes andClark Glymour), is onealgorithm that generates the Markov equivalence class for any givenprobability distribution.
Consider two simple examples involving three variables \(\{X, Y,Z\}\). Suppose our probability distribution has the followingproperties:
Then the Markov equivalence class is:
\[\begin{align}X \rightarrow Y \rightarrow Z\\X \leftarrow Y \leftarrow Z\\X \leftarrow Y \rightarrow Z\end{align}\]We cannot determine from the probability distribution, together withMC and Faithfulness, which of these structures is correct.
On the other hand, suppose the probability distribution is asfollows:
Then the Markov equivalence class is:
\[X \rightarrow Y \leftarrow Z\]Note that the first probability distribution on \(\{X, Y, Z\}\) isthat characterized by Reichenbach’s Common Cause Principle. Thesecond distribution reverses the relations betweenX andZ: they are unconditionallyindependent andconditionallydependent. Contrary to Reichenbach, it isactually the latter pattern of dependence relations that is mostuseful for orienting the causal arrows in the graph. In the lastcausal structure shown,Y is a collider on the path fromX toZ. MCd-separation implies thatcolliders give rise to distinctive conditional independence relations,while all three types of non-collider give rise to the sameconditional independence relations. Many of the algorithms that havebeen developed for inferring causal structure from probabilities workby searching for colliders (see, e.g., SGS 2000: Chapter 5).
The identifiability results discussed so far all assume that thecorrect causal graph is a DAG. However, it is common that latentvariables will be present, and even more common that we might wish toallow for the possibility of latent variables (whether they areactually there or not). If we allow that the correct causal graph maycontain double-headed arrows, we can still applyMCd-separation, and ask which graphs imply the samesets of conditional independence relations. TheMarkov equivalence class will be larger than it was when we did notallow for latent variables. For instance, given the last set ofprobability relations described above, the graph
\[X \rightarrow Y \leftarrow Z\]is no longer the only one compatible with this distribution. Thestructure
\[X \leftrightarrow Y \leftrightarrow Z\]is also possible, as are several others.
A conditional probability such as \(\PP(Y = y \mid X = x)\) givesus the probability thatY will take the valuey, giventhatX has beenobserved to take the valuex.Often, however, we are interested in predicting the value ofYthat will result if weintervene to set the value ofXequal to some particular valuex. Pearl writes \(\PP(Y = y\mid \do(X = x))\) to characterize this probability. What is thedifference between observation and intervention? When we merelyobserve the value that a variable takes, we are learning about thevalue of the variable when it is caused in the normal way, asrepresented in our causal model. Information about the value of thevariable will also provide us with information about its causes, andabout other effects of those causes. However, when we intervene, weoverride the normal causal structure, forcing a variable to take avalue it might not have taken if the system were left alone. The valueof the variable is determined completely by our intervention, thecausal influence of the other variables being completely overridden.Graphically, we can represent the effect of this intervention byeliminating the arrows directed into the variables intervened upon.Such an intervention is sometimes described as ‘breaking’those arrows.
A causal model can be used to predict the effects of such anintervention. Suppose we have a causal model in which the probabilitydistribution P satisfies MC on the causal DAG \(\bG\) over thevariable set \(\bV = \{X_1, X_2 ,\ldots ,X_n\}\). The most usefulversion of MC for thinking about interventions isMCFactorization (seeSection 3.3), which tells us:
\[\PP(X_1, X_2 , \ldots ,X_n) = \prod_i \PP(X_i \mid \PA(X_i))\]Now suppose that we intervene by setting the value of \(X_k\) to\(x_k\). The post-intervention probability \(\PP'\) is the result ofaltering the factorization as follows:
\[\PP'(X_1, X_2 , \ldots ,X_n) = \PP'(X_k) \times \prod_{i\ne k} \PP(X_i \mid \PA(X_i)),\]where \(\PP'(X_k = x_k) = 1\). The conditional probabilities of theform \(\PP(X_i \mid \PA(X_i))\) for \(i \ne k\) remain unchangedby the intervention.
This treatment of interventions has been expanded in a number ofdirections. The ‘manipulation theorem’ (theorem 3.6 of SGS2000) generalizes the formula to cover a much broader class ofinterventions, including ones that don’t break all the arrowsinto the variables that are intervened on. Pearl (2009: Chapter 3)develops an axiomatic system he calls the ‘do-calculus’for computing post-intervention probabilities that can be applied tosystems with latent variables.
Causal modeling is a burgeoning area of research. This entry has largelyignored work on computational methods, as well as applications of thetools discussed here. Rather, the focus has been on the conceptualunderpinnings of recent programs in causal modeling, with specialattention to the connection between causation and probability. It hasalso focused on what it possible to learn about causation “inprinciple” on the basis of probabilities, while ignoring thepractical problems of making causal inferences on the basis of finitedata samples (which inevitably deviate from the trueprobabilities).
The entry on Causal Models covers all of the material in this sectionin greater detail. The most important works surveyed in this sectionare Pearl 2009 and Spirtes, Glymour, & Scheines 2000. Pearl 2010is a short overview of Pearl’s program, and Pearl et al. 2016 isa longer overview. The latter, in particular, assumes relativelylittle technical background. Scheines 1997 and the Introduction ofGlymour & Cooper 1999 are accessible introductions to the SGSprogram. Neapolitan 2004 is a text book that treats Bayes nets incausal and noncausal contexts. Neapolitan & Jiang 2016 is a shortoverview of this topic. Hausman 1999, Glymour 2009, Hitchcock 2009,and Eberhardt 2017 are short overviews that cover some of the topicsraised in this section. The entry oncausation and manipulability contains extensive discussion of interventions, and some discussionof causal models.
Many philosophers and legal theorists have been interested in therelation ofactual causation. This concerns the assignment ofcausal responsibility for an event, based on how events actually playout. For example, suppose that Billy and Suzy each throw a rock at abottle, and that each has a certain probability of hitting andbreaking it. As it happens, Suzy’s rock hits the bottle, andBilly’s doesn’t. As things actually happened, we would saythat Suzy’s throw caused the bottle to shatter, whileBilly’s didn’t. Nonetheless, Billy’s throw increasedthe probability that the bottle would shatter, and it would beidentified as a cause by the theories described in sections2 and3. Billy’s throw had a tendency to shatter the bottle; it was apotential cause of the bottle shattering; it was the sort of thingthat generally causes shattering; but it did not actually cause thebottle to shatter.
A number of authors have attempted to provide probabilistic analysesof actual causation. Some, such as Eells (1991: chapter 6), Kvart(1997, 2004), and Glynn(2011), pay careful attention to the way in which probabilities changeover time. Some, such as Dowe (2004) and Schaffer (2001), combineprobabilities with the resources of a process theory of causation.Some, such as Lewis (1986b), Menzies (1989), and Noordhof (1999),employ probabilities together with counterfactuals to analyze actualcausation. And others such as Beckers & Vennekens (2016),Fenton-Glynn (2017), Halpern (2016: Section 2.5), Hitchcock (2004a),and Twardy & Korb (2011) employ causal modeling tools similar tothose described inSection 3. We will describe two of those theories—Lewis (1986b) andFenton-Glynn (2017)—in more detail in sections4.3 and4.4 below.
InSection 2.5 above, we saw that Eells (1991) defines a variety of different waysin whichC can be causally relevant forE.C canbe a positive, negative, or mixed cause ofE depending uponwhetherC raises, lowers, or leaves unchanged the probabilityofE in various background conditions \(B_i\). A naturalsuggestion is that (i) an actual cause ofE is a type ofpositive cause ofE; but (ii) for assessing actual causation,only the background condition thatactually obtains isrelevant. Putting these ideas together, we get:
As we shall see in the next section, this type of analysis isvulnerable to two types of counterexamples: cases where causes seem tolower (or leave unchanged) the probabilities of their effects; andcases where non-causes seem to raise the probabilities of events thatare not their effects. Most of the theories mentioned in the previoussection can be seen as attempts to improve upon AC1 to deal with thesetypes of counterexample.
Actual causes can sometimes lower the probability of their effects incases ofpreemption: Suppose that Billy and Suzy are aimingrocks at a bottle. Billy decides that he will give Suzy theopportunity to throw first; he will throw his rock just in case Suzydoesn’t throw hers. For mathematical convenience, we will assumethat there is some small probability—0.1 say—that Billydoes not faithfully execute his plan. Billy is a more accurate throwerthan Suzy. If Billy throws his rock, there is a 90% chance that itwill shatter the bottle; if Suzy throws, she has a 50% chance ofsuccess. Suzy throws her rock and Billy doesn’t; Suzy’srock hits the bottle and smashes it. By throwing, Suzy lowered theprobability of shattering from 81% (the probability that Billy wouldboth throw and hit if Suzy hadn’t thrown) to 54.5%(accommodating the small probability that Billy will throw even ifSuzy throws). Suzy’s throwpreempts Billy’sthrow: she prevents Billy from throwing, and substitutes her own, lessreliable throw. Nonetheless, Suzy’s throw actually caused thebottle to shatter.
Changing the example slightly gives us a case of a probability-raisingnon-cause. Suppose that Billy and Suzy throw their rockssimultaneously. As it happens, Suzy’s throw hits the bottle andBilly’s misses. Nonetheless, Billy’s throw increased theprobability that the bottle would shatter from 50% (the probabilitythat Suzy would hit) to 95% (the probability that at least one of themwould hit). But Billy’s throw did not in fact cause the bottleto shatter. In the terminology of Schaffer (2001), Billy’s throwis afizzler. It had the potential to shatter the bottle, butit fizzled out, and something else actually caused the bottle tobreak.
David Lewis is the best-known advocate of a counterfactual theory ofcausation. In Lewis 1973, he offered a counterfactual theory ofcausation under the assumption of determinism. Lewis 1986b presented aprobabilistic extension to this counterfactual theory ofcausation.
Lewis defines a relation ofcausal dependence that issufficient, but not necessary for causation.
The counterfactual in (iii) is to be understood in terms of possibleworlds: it says that in the nearest possible world(s) whereCdoes not occur, the probability ofE is less than or equal toy. (There needn’t be a single value that the probabilitywould have been. It can take on different values in the closestpossible worlds, as long as all of those values are less than or equaltoy.) On this account, the relevant notion of‘probability-raising’ is not understood in terms ofconditional probabilities, but in terms of unconditional probabilitiesin different possible worlds.
Lewis defines causation (what we are calling “actualcausation”) to be theancestral of causal dependence;that is:
This definition guarantees that causation will be transitive: ifC causesD, andD causesE, thenCcausesE. This modification is useful for addressing certaintypes of preemption. Consider the example from the previous section,where Suzy throws her rock, preempting Billy. We can interpolate aneventD between Suzy’s throw,C, and thebottle’s shatteringE. LetD be the presence ofSuzy’s rock on its actual trajectory, at some time after Billyhas already failed to throw. If Suzy hadn’t thrown,Dwould have been much less likely. And ifD hadn’toccurred,E would have been much less probable. SinceDoccurs after Billy has already declined to throw, ifDhadn’t occurred, there would not have beenany rock ona trajectory toward the bottle. Thus there is a chain of causaldependence fromC toD toE.
Despite this success, it has been widely acknowledged (even by Lewishimself) that Lewis’s probabilistic theory has problems withother types of preemption, and with probability-raisingnon-causes.
Fenton-Glynn (2017) offers an analysis of actual causation that isbased on the definition of Halpern and Pearl (2005), who consider onlythe deterministic case. What follows here is a simplified version ofFenton-Glynn’s proposal, as one example of an analysis employingcausal models.
Let \(\bV\) be a set of time-indexed, binary variables, which weassume to include any common causes of variables in \(\bV\) (so thatthe correct causal graph on \(\bV\) is a DAG). Let \(*\) be an assignmentfunction that assigns to each variable \(X\) in \(\bV\) one of itspossible values. Intuitively, \(*\) identifies theactual valueof each variable. We will denote \(*(X)\) by \(x^*\), and \(x'\)will denote the non-actual value of \(X\). If \(\bX\) is a set ofvariables in \(\bV\), \(\bX\) = \(\bx^*\) will be a proposition stating thateach variable in \(\bX\) takes the actual value assigned by \(*\). Let Pbe a probability function on \(\bV\) representing objectiveprobability, which we assume to satisfy the Markov and MinimalityConditions (Sections3.3 and3.4 above). We also assume that P assigns positive probability to everypossible assignment of values to variables in \(\bV\).
Given the identifiability result described inSection 3.5 above, we can recover the correct causal graph \(\bG\) from the probabilityfunction P together with the time-indices of the variables. We can nowuse P and \(\bG\) to compute the effects of interventions, as described insection 3.6 above. We now define actual causation as follows:
Intuitively, this is what is going on: If \(X = x^*\) is an actualcause of \(Y = y^*\) then there has to be at least one directed pathfrom \(X\) to \(Y\). \(\bZ\) will consist of variables that liealong some (but not necessarily all) of these paths. (If \(X\) is adirect cause of \(Y\), then \(\bZ\) can be empty.). F-G requiresthat \(X = x^*\) raises the probability of \(Y = y^*\) in the sense thatinterventions that set \(X\) to \(x^*\) result in higherprobabilities for \(Y = y^*\) than interventions that set \(X\) to\(x'\). Specifically, \(X = x^*\) must raise the probability of \(Y =y^*\) when we also intervene to set the variables in \(\bW\) to theiractual values. \(\bW = \bw^*\) is like a background context of the sortdiscussed inSection 2.4, except that \(\bW\) may include some variables that are descendantsof \(X\). Moreover, \(X = x^*\) must raise the probability of \(Y =y^*\) in conjunction with any combination of variables in \(\bZ\) beingset to their actual values. The idea is that the probabilistic impactof \(X\) on \(Y\) is being constrained to flow through thevariables in \(\bZ\), and at every stage in the process, the value ofthe variables in \(\{X\} \cup \bZ\) must confer a higher probabilityon \(Y = y^*\) than the baseline probability that would have resultedif \(X\) had been set to \(x'\).
Let’s see how this account handles the problem cases fromsection 4.2. For the example of preemption, we will use the followingvariables:
The subscripts indicate the relative times of the events, with largernumbers corresponding to later times. The actual values of thevariables are \(\ST_0= 1\), \(\BT_1= 0\), and \(\BS_2= 1\). Theprobabilities are:
\[\begin{align}\PP(\BT_1= 1 \mid \ST_0= 1) &{} = .1 \\\PP(\BT_1= 1 \mid \ST_0= 0) &{} = .9 \\[1ex]\PP(\BS_2= 1 \mid \ST_0= 1 \amp \BT_1= 1) &{} = .95\\\PP(\BS_2= 1 \mid \ST_0= 1 \amp \BT_1= 0) &{} = .5\\\PP(\BS_2= 1 \mid \ST_0= 0 \amp \BT_1= 1) &{} = .9\\\PP(\BS_2= 1 \mid \ST_0= 0 \amp \BT_1= 0) &{} = .01\\\end{align}\](Note that we have added a small probability for the bottle to shatterdue to some other cause, even if neither Suzy nor Billy throw theirrock. This ensures that the probabilities of all assignments of valuesto the variables are positive.) The corresponding graph is shown inFigure 9.
Figure 9
Applying F-G, we can take \(\bW = \{\BT_1\}\), \(\bZ = \varnothing\).We have:
\[\begin{align}\PP(\BS_2= 1 \mid \do(\ST_0= 1) \amp \do(\BT_1= 0)) &{} = .5\\\PP(\BS_2= 1 \mid \do(\ST_0= 0) \amp \do(\BT_1= 0)) &{} = .01\\\end{align}\]Holding fixed that Billy doesn’t throw, Suzy’s throwraises the probability that the bottle will shatter. Thus theconditions are met for \(\ST = 1\) to be an actual cause of \(\BS =1\).
To treat the case of fizzling fromSection 4.2, let
The actual values are \(\ST_0= 1\), \(\BT_0= 1\), \(\SH_1= 1\),\(\BH_1= 0\), and \(\BS_2= 1\). The probabilities are:
\[\begin{align}\PP(\SH_1= 1 \mid \ST_0= 1) &{} = .5\\\PP(\SH_1= 1 \mid \ST_0= 0) &{} = .01\\[2ex]\PP(\BH_1= 1 \mid \BT_0= 1) &{} = .9\\\PP(\BH_1= 1 \mid \BT_0= 0) &{} = .01\\[2ex]\PP(\BS_2= 1 \mid \SH_1= 1 \amp \BH_1= 1) & {} = .998 \\\PP(\BS_2= 1 \mid \SH_1= 1 \amp \BH_1= 0) & {} = .95\\\PP(\BS_2= 1 \mid \SH_1= 0 \amp \BH_1= 1) & {} = .95 \\\PP(\BS_2= 1 \mid \SH_1= 0 \amp \BH_1= 0) & {} = .01\\\end{align}\]As before, we have assigned probabilities close to, but not equal to,zero and one for some of the possibilities. The graph is shown inFigure 10.
Figure 10
We want to show that \(\BT_0= 1\) is not an actual cause of \(\BS_2=1\) according to F-G. We will show this by means of a dilemma: is\(\BH_1\in \bW\) or is \(\BH_1\in \bZ\)?
Suppose first that \(\BH_1\in \bW\). Then, regardless of whether\(\ST_0\) and \(\SH_1\) are in \(\bW\) or \(\bZ\), we will need tohave
\[\begin{align}\PP(\BS_2 = 1 &\mid do(\BT_0= 1, \BH_1= 0, \ST_0= 1, \SH_1= 1))\\ \mathbin{\gt} &\PP(\BS_2 = 1 \mid do(\BT_0= 0,\BH_1= 0, \ST_0= 1, \SH_1= 1))\\\end{align}\]But in fact both of these probabilities are equal to .95. If weintervene to set \(\BH_1\) to 0, intervening on \(\BT_0\) makes nodifference to the probability of \(\BS_2= 1\).
So let us suppose instead that \(\BH_1\in \bZ\). Then we will need tohave
\[\begin{align}\PP(\BS_2 = 1 &\mid do(\BT_0= 1, \BH_1= 0, \ST_0= 1, \SH_1= 1))\\ \mathbin{\gt} &\PP(\BS_2 = 1 \mid do(\BT_0= 0, \ST_0= 1, \SH_1= 1))\\\end{align}\]This inequality is slightly different, since \(\BH_1= 0\) doesnot appear in the second conditional probability. Nonetheless we have
\[\PP(\BS_2 = 1 \mid do(\BT_0= 1, \BH_1= 0, \ST_0= 1, \SH_1= 1)) = .95\]and
\[\PP(\BS_2 = 1 \mid do(\BT_0= 0, \ST_0= 1, \SH_1= 1)) = .9505\](The second probability is a tiny bit larger, due to the very smallprobability that Billy’s rock will hit even if he doesn’tthrow it.)
So regardless of whether \(\BH_1\in \bW\) or is \(\BH_1\in \bZ\),condition F-G is not satisfied, and \(\BT_0= 1\) is not judged to bean actual cause of \(\BS_2= 1\). The key idea is that it is not enoughfor Billy’s throw to raise the probability of the bottleshattering; Billy’s throw together with what happens afterwardshas to raise the probability of shattering. As things actuallyhappened, Billy’s rock missed the bottle. Billy’s throwtogether with his rock missing does not raise the probabilityof shattering.
Note that this treatment of fizzling requires that we includevariables for whether the rocks hit the bottle. If we try to modelthis case using just three variables, \(\BT\), \(\ST\), and \(\BS\),we will incorrectly judge that Billy’s throw is a cause of thebottle shattering. This raises the question of what is the“right” model to use, and whether we can know if we haveincluded “enough” variables in our model. Fenton-Glynn(2017) includes some discussion of these tricky issues.
While this section describes some success stories, it is safe to saythat no analysis of actual causation is widely believed to perfectlycapture all of our pre-theoretic intuitions about hypothetical cases.Indeed, it is not clear that these intuitions form a coherent set, orthat they are perfectly tracking objective features of the world.Glymour et al. (2010) raise a number of challenges to the generalproject of trying to provide an analysis of actual causation.
The anthologies Collins et al. 2004 and Dowe & Noordhof 2004contain a number of essays on topics related to the discussion of thissection. Hitchcock 2004b has an extended discussion of the problemposed by fizzlers. Hitchcock 2015 is an overview of Lewis’s workon causation. The entry forcounterfactual theories of causation discusses Lewis’s work, and counterfactual theories ofcausation more generally.
How to cite this entry. Preview the PDF version of this entry at theFriends of the SEP Society. Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entryatPhilPapers, with links to its database.
causal models |causation: and manipulability |causation: backward |causation: counterfactual theories of |causation: the metaphysics of |determinism: causal |events |facts |Hume, David |Kant, Immanuel: and Hume on causality |probability, interpretations of |properties |quantum mechanics: action at a distance in |quantum theory: the Einstein-Podolsky-Rosen argument in |Reichenbach, Hans |Reichenbach, Hans: common cause principle |Salmon, Wesley |scientific explanation |Simpson’s paradox
Thanks to Frederick Eberhardt, Luke Fenton-Glynn, Clark Glymour, JudeaPearl, Richard Scheines, Elliott Sober, Jim Woodward and the editorsof the Stanford Encyclopedia of Philosophy for detailed comments,corrections, and discussion.
View this site from another server:
The Stanford Encyclopedia of Philosophy iscopyright © 2024 byThe Metaphysics Research Lab, Department of Philosophy, Stanford University
Library of Congress Catalog Data: ISSN 1095-5054