Simpson’s Paradox

First published Wed Mar 24, 2021

Simpson’s Paradox is a statistical phenomenon where anassociation between two variables in a population emerges, disappearsor reverses when the population is divided into subpopulations. Forinstance, two variables may be positively associated in a population,but be independent or even negatively associated in allsubpopulations. Cases exhibiting the paradox are unproblematic fromthe perspective of mathematics and probability theory, butnevertheless strike many people as surprising. Additionally, theparadox has implications for a range of areas that rely onprobabilities, including decision theory, causal inference, andevolutionary biology. Finally, there are many instances of theparadox, including in epidemiology and in studies of discrimination,where understanding the paradox is essential for drawing the correctconclusions from the data.

The following article provides a mathematical analysis of the paradox,explains its role in causal reasoning and inference, compares theoriesof what makes the paradox seem paradoxical, and surveys itsapplications in different domains.

1. Introduction

We begin with an illustration of the paradox with concrete data. Thenumbers inTable 1 summarize the effect of a medical treatment for the overallpopulation (N = 52), and separately for men and women:

	Full Population, \(\bf N=52\)			Men \(\bf(\r{M})\), \(\bf N=20\)			Women \(\bf(\neg \r{M})\), \(\bf N=32\)
	Success \(\bf(\r{S})\)	Failure \(\bf(\neg \r{S})\)	Success Rate	Success	Failure	Success Rate	Success	Failure	Success Rate
Treatment (T)	20	20	50%	8	5	≈ 61%	12	15	≈ 44%
Control (¬T)	6	6	50%	4	3	≈ 57%	2	3	≈ 40%

Table 1: Simpson's Paradox: the type ofassociation at the population level (positive, negative, independent)changes at the level of subpopulations. Numbers taken fromSimpson’s original example (1951).

For matters of exposition, we assume that these frequencies areunbiased estimates of the underlying probabilities. The treatmentlooks ineffective at the level of the overall population, but it leadsto higher success percentages than the control both for men and forwomen (61% vs. 57% for men and 44% vs. 40% for women). Writing theseproportions as conditional probabilities, with \(\r{T}\)=treatment,\(\r{S}\)=success/recovery, and \(\r{M}\)=male subpopulation, weobtain

\[ p(\r{S}\mid \r{T}) = p(\r{S}\mid \neg \r{T}) \]

but at the same time,

\[\begin{align*} p(\r{S}\mid \r{T}, \r{M}) & \gt p(\r{S}\mid \neg \r{T}, \r{M} ) \\ p(\r{S}\mid \r{T}, \neg \r{M}) &\gt p(\r{S}\mid \neg \r{T}, \neg \r{M}) \end{align*}\]

Should we use the treatment or not? When we know the gender of thepatient, we would presumably administer the treatment, whereas it doesnot look like the right thing to do when we don’t know thepatient’s gender—although we know that the patient iseither male or female!

This phenomenon was first pointed out in papers by Karl G. Pearson(1899) and George U. Yule (1903), but it was Simpson’s shortpaper “The interpretation of interaction in contingencytables” (1951), discussing the interpretation of suchassociation reversals, that led to the phenomenon being labeled as“Simpson’s Paradox”. The phenomenon is, however,broader than independence in the overall population and positiveassociation in the subpopulations; for example, the associations mayalso be reversed. Nagel and Cohen (1934: ch. 16) provide an example ofsuch a reversal as part of a exercise for logic students.

Understanding the paradox is essential for drawing the properconclusions from statistical data. To give a recent example involvingthe paradox (Kügelgen, Gresele, & Schölkopf [seeOther Internet Resources]), early data revealed that the case fatality rate for Covid-19 washigher in Italy than in China overall. Yet within every age group thefatality rate was higher in China than in Italy. One thus appears toget opposite conclusions about the comparative severity of the virusin the countries depending on whether one compares the wholepopulations or the age-partitioned populations. Having a properanalysis of what is going on is such cases is thus crucial for usingstatistics to inform policy.

In what follows,Section 2 explains different varieties of the paradox, clarifies the logicalrelationships between them, and identifies precise conditions for whenthe paradox can occur. While that section focuses on the mathematicalcharacterization of the paradox,Section 3 focuses on its role in causal inference, its implications forprobabilistic theories of causality, and its analysis by means ofcausal models based on directed acyclic graphs (DAGs: Spirtes,Glymour, & Scheines 2000; Pearl 2000 [2009]).

Based on these different approaches,Section 4 discusses different analyses of what makes Simpson’s Paradoxlook paradoxical, and what kind of error it reveals in humanreasoning. This section also reports empirical findings on theprevalence of the paradox in reasoning and inference.Section 5 surveys the occurrence and interpretation of the paradox in appliedstatistics (regression models), philosophy of biology, decision theoryand public policy. For example, Simpson’s Paradox is relevantwhen analyzing data to test for race or gender discrimination (Bickel,Hammel, & O’Connell 1975).Section 6 wraps up our findings and concludes.

2. Definition and Mathematical Characterization

This section shows how Simpson’s Paradox can be characterizedmathematically, under which conditions it occurs, and how it can beavoided. We begin by further considering the concrete example from theintroduction in order to build intuitions that will guide us throughthe more technical results.

The data inTable 1 can be translated into success or recovery rates, showing thattreated men have a higher recovery rate than untreated men (roughly61% vs. 57%), and the same for women (44% vs. 40%). Two observationsare key to understanding why this positive association vanishes in theaggregate data. First, the recovery rate of untreated men is stillhigher than the recovery rate of women who receive treatment (57% vs.44%), suggesting that not only treatment, but also gender is arelevant predictor of recovery. Second, while the treatment group ismajority female (27 vs. 13), the control group is majority male (7 vs.5). Speaking informally, the lack of population-level correlationbetween treatment and recovery results from men being both (i) morelikely to recover from the treatment, and (ii) less likely to be inthe treatment group.

This becomes evident when we use conditional probabilities torepresent recovery rates given treatment and/or subpopulation. Theoverall recovery rates given treatment and control can, by the Law ofTotal Probability, be written as the weighted average of recoveryrates in the subpopulations:

\[\begin{align*}p(\r{S}\mid \r{T}) &= p(\r{S}\mid \r{T},\r{M}) p(\r{M}\mid \r{T}) + p(\r{S}\mid \r{T}, \neg \r{M}) p(\neg \r{M}\mid \r{T}) \\ p(\r{S}\mid \neg \r{T}) &= p(\r{S}\mid \neg \r{T},\r{M}) p(\r{M}\mid \neg \r{T}) + p(\r{S}\mid \neg \r{T}, \neg \r{M}) p(\neg \r{M}\mid \neg \r{T})\end{align*}\]

Plugging in the numbers fromTable 1 to calculate the overall recovery rates via these equations, we seethat the first line is a weighted average of success rates for treatedmen and women (61% and 44%) while the second line is a weightedaverage of success rates of the two control groups (57% and 40%).These averages are weighted by the percentage of males and females ineach group, and in the present case the gender disparity between thegroups results in both averages being 50%. Since these weights can bedifferent, the treatment may raise the probability of success among males and femaleswithout doing so in the combined population.

Later we will show that the positive association in the subpopulationscannot vanish if the correlation of treatment with gender is broken(e.g., by balancing gender rates in both conditions). The weights ineach line are then identical—\(p(\r{M}\mid \r{T}) =p(\r{M}\mid \neg \r{T})\)—and associations in subpopulationsare preserved for the aggregate data (Theorem 1 in Section 2.2). In fact, the absence of such a correlation rules out Simpson’sParadox.

2.1 Varieties of Simpson’s Paradox

Simpson’s Paradox can occur for various types of data, butclassically, it is formulated with respect to \(2\times2\) contingencytables. Let \(D_i = (a_i, b_i, c_i, d_i)\) be a four-dimensionalvector of real numbers representing the \(2\times2\) contingency tablefor treatment and success in thei-th subpopulation, and let

\[D = \sum_{i=1}^N D_i = \left(\sum a_i, \sum b_i, \sum c_i, \sum d_i\right)\]

be the aggregate data set over \(N\) subpopulations. These data shouldbe read as shown inTable 2.

	Population \(\bf \i{D} =\i{D}_1+\i{D}_2\)		Subpopulation \(\bf \i{D}_1\)		Subpopulation \(\bf \i{D}_2\)
	Success (\(\bf \i{S}\))	Failure (\(\bf \neg \i{S}\))	Success (\(\bf \i{S}\))	Failure (\(\bf \neg \i{S}\))	Success (\(\bf \i{S}\))	Failure (\(\bf \neg \i{S}\))
Treatment (\(\bf \i{T}\))	\(a_1 + a_2\)	\(b_1 + b_2\)	\(a_1\)	\(b_1\)	\(a_2\)	\(b_2\)
No Treatment (\(\bf\neg \i{T}\))	\(c_1 + c_2\)	\(d_1 + d_2\)	\(c_1\)	\(d_1\)	\(c_2\)	\(d_2\)

Table 2: Abstract representation of a\(2 \times 2\) contingency table with subpopulations \(D_1\) and\(D_2\).

Let \(\alpha (D_i)\) be a measure the strength of the probabilisticassociation between \(T\) and \(S\) in population \(D_i\).^[1] By convention, \(\alpha (D_i) = 0\) corresponds to no associationbetween the variables, \(\alpha (D_i) \gt 0\) indicates a positiveassociation, and \(\alpha (D_i) < 0\) a negative one. This can bestbe translated into the condition

\[\begin{align*}\tag{1}\alpha (D_i) & \begin{cases}> 0 & \qquad \text{if and only if} \qquad a_i \, d_i > b_i \, c_i; \\= 0 & \qquad \text{if and only if} \qquad a_i \, d_i = b_i \, c_i; \\> 0 & \qquad \text{if and only if} \qquad a_i \, d_i < b_i \, c_i.\end{cases}\end{align*}\]

The condition \(a_i \, d_i > b_i \, c_i\) is equivalent to sayingthat the success rate in the first row (“treatmentcondition”) is higher than the success rate in the second row(“control condition”):

\[ a_i/(a_i+b_i) > c_i/(c_i+d_i).\]

Applying all this to our dataset inTable 1, we see that \(\alpha(D) = 0\) although \(\alpha(D_1) > 0\) and\(\alpha(D_2) > 0\). This is a special case of what Samuels (1993)callsAssociation Reversal (AR). Association reversaloccurs if and only if there is a population such that the associationinall partitioned subpopulations is either (i) positive (ii)negative, or (iii) zero, and the type of association in the populationdoes not match that of the subpopulations. Writing this outmathematically, this means for a dataset \(D = \sum_{i=1}^N D_i\) thatone of the following two conditions holds,

\[\begin{align*} \alpha(D) &\le 0 \qquad \text{and} & \alpha(D_i) &\ge 0 \qquad \forall \; 1 \le i \le N \tag{AR1}\\\alpha(D) &\ge 0 \qquad \text{and} & \alpha(D_i) &\le 0 \qquad \forall \; 1 \le i \le N \tag{AR2}\end{align*}\]

where at least one of the inequalities has to be strict. Associationreversal is the standard variety of Simpson’s Paradox(Bandyopadhyay et al. 2011; Blyth 1972, 1973) and also the one that ismost frequently investigated in the psychology of reasoning, or byphilosophers analyzing the paradox (e.g., Cartwright 1979; Eells 1991;Malinas 2001).

An important special case of AR occurs when there is no association inthe subpopulations, but an association emerges in the overall dataset:

\[\begin{align*}\alpha(D_i) &= 0 \qquad \forall 1 \le i \le n \qquad \text{but} & \alpha(D) &\ne 0 \tag{YAP}\end{align*}\]

Referring to the pioneering work of the statistician George U. Yule(1903: 132–134), Mittal (1991) calls thisYule’sAssociation Paradox (YAP). It is typical of spuriouscorrelations between variables with a common cause, that is, variablesthat are dependent unconditionally (\(\alpha(D) \ne 0\)) butindependent given the values of the common cause (\(\alpha(D_i) =0\)). For example, sleeping in one’s clothes is correlated withhaving a headache the next morning. However, once we stratify the dataaccording to the levels of alcohol intake on the previous night, theassociation vanishes: given the same level of drunkenness, people whoundress before going to bed will have the same headache, ceterisparibus, as those who kept their clothes on.

Finally, the most general version of Simpson’s Paradox is theAmalgamation Paradox (AMP) identified by Good andMittal (1987). This paradox occurs when the overall degree ofassociation is bigger (or smaller) than each degree of association inthe subpopulations, or mathematically,

\[\begin{align*} \alpha(D) &> \max_{1 \le i \le N} \alpha(D_i) \qquad \text{or} & \alpha(D) &< \min_{1 \le i \le N} \alpha(D_i). \tag{AMP} \end{align*}\]

AMP challenges the intuition that the degree of association in thegeneral population, in virtue of being “the sum” of theindividual subpopulations, has to fall in between the minimal and themaximal degree of association observed on that level. The logicalstrength of the paradoxes is inversely related to their generality andfrequency of occurrence: \(\text{YAP} \Rightarrow \text{AR}\Rightarrow \text{AMP}\). Variations of the paradox fornon-categorical data (e.g., bivariate real-valued data) will bediscussed inSection 5.1.

2.2 Necessary and Sufficient Conditions

We proceed to characterizing the mathematical conditions under whichSimpson’s Paradox occurs. We have already suggested that theparadox arises in the medical example due to correlations between thetreatment variable and the partitioning variable, and we can now makethis more precise:

Theorem 1 (Lindley & Novick 1981; Mittal1991): If \(\alpha(D) > 0\) and association reversal occursfor the subpopulations characterized by attribute \(\r{M}\) and\(\neg\r{M}\), (i.e., \(\alpha(D_1), \alpha(D_2) \le 0\)), theneither

\(\r{M}\) is positively related to \(\r{S}\) and \(\r{T}\);or
\(\r{M}\) is positively related to \(\neg\r{S}\) and\(\neg\r{T}\).

As Theorem 1 makes clear, the lack of correlation between \(\r{M}\)and \(\r{T}\) is sufficient to rule out association reversals (andthus YAP as well). Does it also rule out the more general amalgamationparadox? The answer to this depends on whichmeasure ofassociation one chooses for \(\alpha\). Discussions ofSimpson’s Paradox commonly treat association as thedifference in the success rate between the treated and theuntreated, but this is only one of many possibilities (Fitelson 1999).While the lack of association between \(M\) and \(T\) is sufficient torule out AMP for most measures (including the difference measure) itdoes not rule it out for all measures, as we will now explain. Readersnot interested the specific details may skip to the followingsection.

Here are some widely used association measures for a dataset \((a, b,c, d)\):

\[\begin{align*}\pi_{D} &= \frac{a}{a+b} - \frac{c}{c+d} & \pi_{Y} &= \frac{ad -bc}{N^2}\\%\pi_{R} &= \log \left(\frac{a}{a+b} \cdot \frac{c+d}{c} \right) & \pi_{W} &= \log \left(\frac{a}{a+c} \cdot \frac{b+d}{b} \right) \\%\pi_{O} &= \log \frac{ad}{bc} & \pi_{C} &= \log \left(\frac{d}{c+d} \cdot \frac{a+b}{a} \right) \end{align*}\]

Some of these measures can be formulated probabilistically and havebeen suggested as measures of causal strength and outcome measures forclinical trials (Edwards 1963; Eells 1991; Fitelson & Hitchcock2011; Greenland 1987; Peirce 1884; Sprenger 2018; Sprenger &Stegenga 2017). For example, \(\pi_{D} = p(\r{S}\mid \r{T}) -p(\r{S}\mid \neg \r{T})\) represents the difference and \(\pi_R =p(\r{S}\mid \r{T}) / p(\r{S}\mid \neg \r{T})\) the ratio of successrates in treatment and control conditions. \(\pi_W\) can beinterpreted as the prognostic weight of evidence that treatmentprovides for success (i.e., as the log-Bayes factor), \(\pi_{Y}\) isYule’s (1903) measure of association, \(\pi_{O}\) is thelog-odds ratio familiar from epidemiological data analysis, and\(\pi_C\) I.J. Good’s (1960) measure of causal strength.

We now consider the extent to which AMP for different measures isruled out by different experimental designs. Suppose that individualsare uniformly assigned to the treatment and control condition acrosssubpopulations. In such a case, where the ratio of persons assigned tothe treatment and control condition is equal for each subpopulation,the experimental design is calledrow-uniform.Specifically, there has to be a \(\lambda > 0\) such that for anysubpopulationi

\[ a_i + b_i = \lambda (c_i+d_i) \tag{Row Uniformity}\]

In particular, row uniformity holds approximately if our sample islarge and we sample at random from the population.

Row-uniform design of a trial ensures independence between a potentialconfounder \(M\) and the treatment variable \(T\). Accordingly, byTheorem 1, it rules out association reversals. Additionally, row-uniform designis sufficient to rule out the AMP for a wide class of associationmeasures:

Theorem 2 (Good & Mittal 1987): If adataset \(D = \sum D_{i}\) satisfies row uniformity, then theAmalgamation Paradox is avoided for the measures \(\pi_{D}\),\(\pi_{R}\), \(\pi_{Y}\) and \(\pi_{W}\) and \(\pi_{C}\). It isnot avoided for the log-odds ratio \(\pi_{O}\).

Some studies also exhibitcolumn-uniform design wherethe proportion of successes and failures is constant across allsubpopulations:

\[ a_i + c_i = \lambda (b_i+d_i) \tag{Column Uniformity}\]

Also then \(\r{M}\) is independent of \(\r{S}\). Column uniformity canoccur in case-control studies with various subpopulations (e.g.,different hospitals) where one does not match the number of personswith the explanatory attribute, like in an RCT. Instead, for eachperson with a certain attribute (e.g., a specific form of cancer), oneselects a number of persons that does not have this attribute.Column-uniform design avoids AR as well, but among the presentedassociation measures, it suffices to rule out AMP only for\(\pi_Y\).

	Association Measure
Avoids AMP?	\(\pi_{D}\)	\(\pi_{R}\)	\(\pi_{O}\)	\(\pi_{Y}\)	\(\pi_{W}\)	\(\pi_{C}\)
Row-uniform design	yes	yes	no	yes	yes	yes
Column-uniform design	no	no	no	yes	no	no
Both	yes	yes	yes	yes	yes	yes

Table 3: An overview of how row- andcolumn-uniform design avoid the amalgamation paradox for variousassociation measures.

Table 3 summarizes the properties of all association measures with respect tothe AMP and the different forms of experimental design. The behaviorof the log-odds measure \(\pi_O\), where neither row- norcolumn-uniform design suffices to rule out the AMP, will be discussedinSection 5.2.

We now identify one last fundamental condition for when data exhibitassociation reversal. Have a look atFigure 1 which displays the success proportions for treatment and controlgraphically.

a diagram: link to extended description below

Figure 1: A geometrical representationof a necessary condition for the occurrence of Association Reversal.The paradox can occur if the proportions are ordered like in the leftgraph; it cannot occur if they are ordered like in the right graph.[Anextended description of figure 1 is in the supplement.]

In both examples, the treatment success rate is for bothsubpopulations greater than the control success rate. When will thisorder be preserved at the overall level? We know that the overallsuccess ratefor each condition (treatment/control) isconstrained by the success rates in the subpopulations:

Fact 1: Suppose \(a_i, b_i > 0\) for all\(1 \le i \le N\). Then also

\[\begin{align*}\tag{2} \min \frac{a_i}{a_i+b_i} \le \frac{\sum_{j=1}^N a_j}{\sum_{j=1}^N (a_j+b_j)} \le \max \frac{a_i}{a_i+b_i} \end{align*}\]

This fact follows directly from the Law of Total Probability (proofomitted) and it gives us a simple necessary condition for theoccurrence of Association Reversal (AR): turning toFigure 1 again, it implies that the overall success rate per condition has tobe on the solid lines. Thus ARcannot occur in the right partofFigure 1, but it can occur if the proportions are ordered as in the left partofFigure 1. Generally, AR is avoided when the following condition holds:

\[\tag{RH}\begin{align*}\max_{1 \le i \le N} \frac{a_i}{a_i+b_i} & < \min_{1 \le i \le N} \frac{c_i}{c_i+d_i} \\[2ex]\:\text{ or }\:\\[2ex] \min_{1 \le i \le N} \frac{a_i}{a_i+b_i} & > \max_{1 \le i \le N} \frac{c_i}{c_i+d_i} \end{align*}\]

Any dataset that satisfies(RH) will be calledrow-homogenous. By contrast, for anygiven set of proportions violating condition(RH), we can find datasets exhibiting these very same proportions such thatAR indeed occurs (by fiddling with the size of the subpopulations;Lemma 3.1 in Mittal 1991). However, neither row homogeneity, nor theanalogous condition of column homogeneity, nor their conjunction issufficient for avoiding the amalgamation paradox AMP.

Finally, one might be interested in how frequently the paradox arises.Simulations by Pavlides and Perlman (2009) suggest that it should notoccur frequently: the confidence interval for the probability of AR isa subset of the interval \([0;0.03]\) for both the uniform prior andthe (objective) Jeffreys prior. Of course, the practical value of thisdiagnosis depends on whether the sampling assumptions are sensible,and whether the entire approach makes sense for real-life datasetswhere researchers can group the data into subpopulations alongnumerous dimensions.

3. Simpson’s Paradox and Causal Inference

Within the philosophical literature, Simpson’s Paradox receivedsustained attention due to its implications for accounts of causalitythat posit systematic connections between causal relationships andprobability-raising. Specifically, the paradox reveals that factsabout probability-raising will not necessarily be preserved when onepartitions a population into subpopulations. This poses a number ofimportant challenges to philosophical accounts of causal inferencebased on the concept of probability:

What is the appropriate set of background factors for determiningwhen a probabilistic relationship is causal?
What do association reversals imply for causal inference?
Does Simpson’s Paradox threaten the objectivity of causalrelationships?

Strategies for treating the paradox and answering these questions havecontributed substantially to the development of theories ofprobabilistic causality (Cartwright 1979; Eells 1991). A different setof answers is provided by more recent work on the paradox in theframework of graphical causal models (e.g., Pearl 1988, 2000 [2009];Spirtes et al. 2000), and we will discuss both accounts in turn. Inparticular, we will explain how Simpson’s Paradox can beanalyzed through the notions of confounding and the identifiability ofa causal effect.

3.1 Probabilistic Causality and Simpson’s Paradox

Early accounts of probabilistic causation (e.g., Reichenbach 1956;Suppes 1970) sought to explicate causal claims purely in terms ofprobabilistic and temporal facts. On Suppes’ (1970) account,event \(\r{C}\) is aprima facie cause of \(\r{E}\) if andonly if (i) \(\r{C}\) occurs before \(\r{E}\) and (ii) \(\r{C}\)raises the probability of \(\r{E}\).^[2] As we have already seen inSection 2.1, not allprima facie causes are genuine causes. If I drink astrong blond Belgian beer now, I will probably be happy during theday, but also have a headache tomorrow. However, being happy would notthereby by the cause of the headache: the correlation is explained bythe common cause—the beer drinking. The variable for drinkingthe beerscreens off the probabilistic relationship betweenits effects, meaning that the effects will be uncorrelated when oneconditions on it. The crux of Suppes’ account is that aprima facie causal relationship between \(\r{C}\) and\(\r{E}\) is a genuine causal relationship iff there is no factor Fprior to C that screens off \(\r{C}\) from \(\r{E}\).^[3]

Later theorists such as Cartwright (1979) and Eells (1991) developedthis condition by making causal claims relative to a causallyhomogenousbackground context, which is specified by a set ofvariables \(\b{K}\). Consider the following example of associationreversal presented by Cartwright. Supposing that smoking \((\r{S})\)is a cause of heart disease \((\r{H})\), one might expect that smokingwould raise the probability of heart disease. Yet this might not bethe case. Suppose that in a population there is a strong correlationbetween smoking and exercising (X), and that exercise lowers theprobability of heart disease by more than smoking raises itsprobability. In such a case, smoking might lower the probability ofheart disease although conditional on either \(X\) or \(\neg X\),\(\r{S}\) raises \(\r{H}\)’s probability.

Cartwright interprets this case as follows: causes always raise theprobability of their effects, but this can be “concealed”by the correlation between the cause and some other variable (here,\(X\)). In order to isolate the genuine probabilistic relationshipbetween \(\r{C}\) and \(\r{E}\), one needs to consider it in a contextwhere such correlations cannot occur:

Probabilistic Causality (Cartwright) Let\(\b{K}\) denote all and only the causes of \(\r{E}\)other than\(\r{C}\) and effects of \(\r{C}\). Then \(\r{C}\) causes\(\r{E}\) if and only if relative to all combinations of valuesvariables in \(\b{K}\), \(\r{C}\) raises the probability of \(\r{E}\):\(p(\r{C}\mid \r{E},\b{K}) > p(\r{C}\mid\neg{\r{E},\b{K}})\).

While Suppes defends areductive account of probabilisticcausality, where the elements of \(\b{K}\) are determined withoutappeal to causal assumptions, Cartwright presents anon-reductive account where \(\b{K}\) must include all andonly the causes of \(\r{E}\), excluding \(C\) itself and any variablesthat are causally intermediate between \(\r{C}\) and \(\r{E}\). Thecurrent consensus is that it is impossible to give a probabilisticaccount of causation without relying and causal concepts, and thusthat no non-reductive account is feasible (though see Spohn 2012 fora dissenting view).

Although non-reductive accounts could not be used to explain causationto someone with no prior causal knowledge, they can neverthelessclarify how causal claims are tested, and illuminate the relationshipbetween causation and probability (see also Woodward 2003:20–22). Moreover, Cartwright argues that her general criterionfor inclusion of background factors in \(\b{K}\) avoids thereference class problem for purely statisticalaccounts of causal explanation (i.e., by specifying the relevantpopulations for evaluating causal claims), thereby eliminating athreat to the objectivity of causal explanation. More detail isprovided in theentry on probabilistic causality.

3.2 Specific Debates: Causal Interaction, Average Effects, Mediators

Cartwright’s innovations for probabilistic accounts of causalityhave triggered various debates related to Simpson’s Paradox. Wehighlight three of them here:

Debate 1: Causal Interaction

Cartwright claims that causes raise the probabilities of their effectsacross all background contexts,^[4] but many purported causes only raise the probabilities of theireffects in some contexts. In the latter cases, causesinteract with background factors in producing their effects.To give Cartwright’s own example (1979: 428), ingesting an acidpoison generally causes death, except in contexts where one alsoingests an alkali poison (in which case the two cancel one anotherout). The problem of such interactive causes for probabilisticaccounts is that they threaten Cartwright’s picture on which theeffect of probability-raising causes is “concealed” by astronger negative cause which “dominates” them. Thesemetaphors suggest that the probability-raising relationship between acause and its effect reflects an intrinsic relationship between thevariables that exists even when not manifested, an idea furtherdeveloped in (Cartwright 1989). Interaction means that causes do notoperate in a vacuum, but rather only in the presence of backgroundfactors (for further discussion, see Otte 1985; Eells 1986; Hardcastle1991).

Simpson’s Paradox should not be conflated with causalinteraction, however. What is distinctive of the paradox is not thatthe probabilistic relationship reverses upon partitioning, but ratherthat it reverses inall of the resulting subpopulations.

Debate 2: Average Effects

Cartwright requires \(\b{K}\) to include all causes of \(\r{E}\), andthus to evaluate effects relative to homogenous background contexts.The account thus does not allow for average effects. For example,suppose that a particular treatment \((\r{T})\) raises the probabilityof heart disease \((\r{H})\) in individuals who were born prematurely\((\r{P})\) but not individuals who were not, and that \(\r{P}\) isnot correlated with \(\r{T}\). In the whole population, the amount bywhich \(\r{T}\) lowers or raises the probability of \(H\) will be anaverage of the effects in the \(\r{P}\) and \(\neg\r{P}\) populations,weighted by their size. Dupré (1984) argues for abandoning therequirement that \(\b{K}\) include all causes of \(E\), and thus forallowing average effects.

A tempting lesson to draw from our opening example is thatSimpson’s Paradox arises as a result of averaging over thepopulations of males and females, and that the only way to eliminateit is by ruling out average effects. However, causal heterogeneitydoes not by itself lead to the paradox.^[5] Cases with heterogeneous background factors only produce associationreversal if the factors are correlated with the causalvariable—as demonstrated byTheorem 1 inSection 2.2.

Debate 3: Mediators

According to Cartwright, the set \(\b{K}\) should not includevariables that are causally intermediate between \(C\) and \(E\). Suchvariables are calledmediators. To see why, imagine adrug reduces the risk of heart disease by producing a chemical,represented by variable \(Z\), in the blood stream, and via no otherfactors. If \(C\) and \(E\) have no common causes, they will beprobabilistically independent conditional on \(Z\). Intuitively, oneshould not hold the blood chemical fixed in evaluating the effect,since it is the means by which the effect is brought about.

When there aremultiple paths between cause andeffect, the question becomes more complex. Hesslow (1976)provides an example where taking birth control pills promotes ablood-clotting condition called thrombosis via a chemical in theblood, but inhibits it via preventing pregnancy, which itself is acause of thrombosis. As a result, taking birth control intuitivelyinfluences thrombosis both positively and negatively. If one isinterested in the net effect of \(C\) on \(E\)—as opposed to theeffects via particular paths (Hitchcock 2001)—then one shouldnot condition on mediators. However, conditioning is necessary forcalculating path-specific effects (e.g., Pearl 2001; Weinberger2019).

Distinguishing mediators from common causes is crucial for analyses ofSimpson’s Paradox. For example, the causal models \(C\to Z \toE\) and \(C\leftarrow Z \to E\) exhibit the same conditionalindependencies: \(C\) and \(E\) will be associated unconditionally,but independent conditional on \(Z\). Only causal knowledge enables usto decide how we shall deal with the association reversal, and whetherwe need to condition upon \(Z\) when estimating the causal effect of\(C\) on \(E\) (we do in the second model, but not in the first). SeealsoSection 3.4.

3.3 DAGs and Causal Identifiability

In recent years, the formal analysis of causation has beensignificantly enhanced by the development of graphical methods forrepresenting causal hypotheses and for choosing among candidatehypotheses given one’s evidence, in particular those usingdirectedacyclic graphs (=DAGs: Pearl 1988, 2000 [2009]; Spirtes etal. 2000). A DAG contains a set of nodes connected by a set ofdirected edges or arrows such that there are no cycles (one cannot getfrom a node back to itself via a set of directed arrows). In thecausal context, the nodes in a DAG are random variables and the arrowscorrespond to direct causal relationships. It is common to assume thatthe set of variables in a DAG iscausally sufficient, meaningthat it includes all common causes of variables in the set.

DAGs enable one to systematically map the relationship between causalhypotheses and joint probability distributions. They overlap with andbuild on techniques in the literature on probabilistic causality, butprovide significantly stronger tools and results. See the entries oncausal models,causation and manipulability andcounterfactual theories of causation for detailed introductions to causal inference with DAGs.

two diagrams: first has three ovals labeled, Gender, Treatment, and Success. The Gender oval has arrows pointing to the other two and the Treatment oval has an arrow pointing to Success. Second is a duplicate of the first except the arrow from Gender to Treatment is dotted and a fourth oval, Intervention, also points to Treatment.

Figure 2: The relationship between thevariablesTreatment,Gender, andSuccessrepresented in a DAG, without and with an intervention onTreatment.

Figure 2 (left part) presents a plausible DAG for our running example,including the variablesTreatment,Gender, andSuccess. There are two ways in whichTreatmentprovides information aboutSuccess. One is that people whotake the treatment may be more (or less) likely to recover as a resultof having taken it. The other is that learning that someone took thetreatment provides information about whether they are likely to bemale or female, and this information is relevant to determiningwhether they will recover regardless of whether they took thetreatment.

The graphs can, however, also be interpreted causally, and here, thenotion of anideal intervention is crucial:

For an intervention on a variable \(V\) to beidealis for it to determine \(V\)’s value such that it no longerdepends on its other causes in the DAG. Graphically, we can representan intervention by adding an additional causeI that“breaks” all of the arrows that would otherwise go into\(V\).

So, inFigure 2Intervention is an ideal intervention onTreatment.Intervening onTreatment disrupts the evidential relationshipwithGender—for example, by controlling for theproportion of male and female patients in each sample—so thatany remaining probabilistic relationship between treatment andrecovery can only be explained by having taken the treatment. Such anexperimental design, whereTreatment andGender aremade probabilistically independent, suffices to rule out associationreversal (cf.Section 2.2).

Using the notion of an ideal intervention, one can explicate causationas follows (Pearl 2000 [2009]; Woodward 2003). \(C\) causes \(E\) ifand only if it is possible to change the value or probability of \(E\)via some ideal intervention on \(C\). Such interventions distinguishbetween causal and merely probabilistic dependencies by eliminatingany probabilistic relationship between \(C\) or \(E\) that can betraced to the influence of a common cause. This does not mean,however, that one can only get causal knowledge in cases where one canexperimentally intervene. One of the key contributions of graphicalcausal models is that they enable one to systematically determine whenone’s prior causal knowledge licenses one to interpret aparticular probabilistic relationship causally.

The difference between the probability distributions resulting fromconditioning and from intervening is formally represented bysupplementing the probability calculus with thedo-operator (\(\do(X)\)) whereapplying the operator to a variable formally represents interveningupon it. Taking \(T\), \(S\), and \(M\) to denoteTreatment,Success, andGender, and giventhe graph inFigure 2, the observational probability distribution of \(S\) given \(T\) is not equal tothe probability distribution of \(S\) given an intervention on \(T\):

\[\tag{3}\label{int}p(\r{S}\mid \r{T}) \ne p(\r{S}\mid \do(\r{T}))\]

The difference between these two quantities is due to the impact of\(M\) on the distribution of \(T\). In contrast, the following twoexpressions are equivalent given the DAG:

\[\tag{4}\label{cond}p(\r{S}\mid \r{T},\r{M}) = p(\r{S}\mid \do(\r{T}),\r{M})\]

Here one can infer the effect of \(T\) on \(S\) from the observationaldistribution by conditioning on \(M\). In such a case, we say that (4)identifies the causal effect of \(T\) on \(S\). Moregenerally,identifiability is a relationship betweena DAGG, probability distribution \(P\) and a causal quantity\(\r{Q}\), such that \(\r{Q}\) is identifiable if and only if it isuniquely determined by \(P\) given \(G\). By contrast, when there areunmeasured common cause(s) of \(S\) and \(T\), the probabilitydistribution is compatible with any possible distribution for\(p(\r{S}\mid \do (\r{T}))\).

3.4 Confounding and Pearl’s Analysis of the Paradox

The concept of identifiability is crucial for understandingconfounding, and the analysis of Simpson’s Paradox throughgraphical causal models. The relationship between \(X\) and \(Y\) isconfounded relative to variable set \(\b{Z}\) just incase \(P(Y\mid X,\b{Z}) \ne P(Y\mid \do(X),\b{Z})\) (i.e., therelationship is not identified). Aconfounding set ofvariables is one thatbiases the effect measurement.For instance, an unmeasured common cause is a confounder because itmakes it impossible to differentiate the probabilistic dependencebetween the variables resulting from the common cause from thatresulting from a causal relationship between them. Simpson’sParadox emerges on this account due to confounding by the thirdvariable. This notion of confounding can diverge from a commoncolloquial understanding of confounders as alternative explanations ofan observed outcome other than the treatment.

A useful sufficient condition for identifiability is theback-door criterion (Pearl 1993, 2000 [2009: 79]).First we need to introduce some graphical terminology. Apathbetween \(X\) and \(Y\) be a set of connected edges between \(X\) and\(Y\) going in any direction. \(Y\) is adescendant of \(X\)if there is a path from \(X\) to \(Y\) in which all the arrows go inthe same direction. When \(X\) and \(Y\) are connected via a singlepath including a common cause such as \(X \leftarrow Z \rightarrowY\), \(X\) and \(Y\) will typically^[6] be unconditionally probabilistically dependent, but will beindependent conditional on \(Z\). For such a path, we say that \(Z\)blocks the path between \(X\) and \(Y\). In contrast, when\(X\) and \(Y\) are connected by a path including a commoneffect, such as \(X \rightarrow Z \leftarrow Y\), then thepath will be blocked provided that one doesnot condition on\(Z\) or a descendant of \(Z\). This reflects the fact thatindependent causes of a common effect will typically be dependentconditional on a common effect. An effect of \(X\) on \(Y\) isidentifiable if there are no unblocked “back-door paths”between \(X\) and \(Y\): all paths that pass through common causes areblocked, and all other paths excepting those by which the causeinfluences its effect are open.

Back-door Criterion (Pearl 1993) Given avariable pair \(\{X,Y\}\) in a DAG G, the effect of \(X\) on \(Y\) isidentifiable if there exists a variable set \(\b{Z}\) in Gsatisfying the following conditions:

No node in \(\b{Z}\) is a descendant of \(X\), and
\(\b{Z}\) blocks every path between \(X\) and \(Y\) containing anarrow into \(X\).

In this case, the effect of \(X\) on \(Y\) is identified by theformula

\[\tag{5} p(\r{Y}\mid \do(\r{X})) = \sum_{Z} p(\r{Y}\mid \r{X},\r{Z}) \, p(\r{Z})\]

Equation (5) reveals that can be possible to derive a causal effect in apopulation by averaging over the effects in subpopulations partitionedby \(Z\). This is what we already saw inSection 2.2: if there is no dependence between being treated and being a part of asubpopulation, associations cannot reverse at the general populationlevel. Yet such a derivation is only licensed by causal assumptionsabout the relationships between the variables. The reader can verifythat given the DAG inFigure 2, the variables satisfy the back-door criterion (with \(\b{Z}=\{\textit{Gender}\}\)).^[7]

In our original example, the treatment increased the probability ofrecovery in each subpopulation, but not it in the population as awhole. Should one approve the drug or not? The causal approach makesit easy to see why one should. The probabilistic relationship betweenTreatment andSuccess in the population is anevidential rather than a causal one. Learning that someone took thedrug provides evidence about their gender, and this information isrelevant to predicting whether they will recover. But this does nottell one about whether the drug is causally efficacious. To learnthis, one needs to know how the chances of recovery for individuals inthe population would change given anintervention ontreatment. This can be determined by conditioning ongenderwhich enables one both to learn the gender-specific effects of thedrug, and to derive the average effect in the whole population (usingthe back-door criterion).

Figure 3: The DAG for the variablesTreatment,Gender andSuccess (thirdvariable = confounding factor), contrasted with the DAG for thevariablesBirth Control,Pregnancy, andThrombosis (third variable = mediator).

Thus, whether one should partition the population based on a factor inorder to identify a particular causal relationship does not dependonly on the statistical distribution, but crucially on one’scausal background assumptions. Suppose that one was considering anintermediate variable such asPregnancy in Hesslow’s(1976) example. Recall that in the example birth control influencesthrombosis both positively via a blood chemical and negatively byreducing one’s chance of getting pregnant. This case is shown inFigure 3 and contrasted with our running example where the third variable is aconfounding factor. In order to identify the effect of birth controlon thrombosis, it is crucial that one doesnot condition onpregnancy. If there are no unmeasured common causes of birth controland thrombosis, then a probability-raising relationship between birthcontrol and thrombosis in the population as a whole would reliablyindicate that taking birth control pills promotes thrombosis.

It is worth emphasizing that there is no basis for distinguishing thetwo causal structures inFigure 3 using statistics alone. Any data generated by the model on the leftcould also have been generated by a model with the causal structure ofthat on the right. Accordingly the judgment that one should partitionthe population in one case but not the other cannot be based on theprobabilities alone, but requires the additional information suppliedby the causal model.

Coherent withTheorem 1, Pearl proves a causal version of Savage’s (1954) sure-thingprinciple (see alsoSection 5.3):

Causal sure-thing principle (Pearl 2016) Anaction \(\r{C}\) that increases the probability of an event \(\r{E}\)in each subpopulation must also increase the probability of \(\r{E}\)in the population as a whole, provided that the action does not changethe distribution of the subpopulations.^[8]

For example, if one assumes thatGender is not an effect ofTreatment, it cannot be the case that the drug raises theprobability of recovery in both males and females, but has no effecton recovery in the general population. This result provides an errortheory for why people often find Simpson’s Paradox to beparadoxical in the first place. Specifically, Pearl (2000 [2009],2014) claims that people conflate observational claims that \(\r{X}\)raises the probability of \(\r{Y}\) with causal claims thatdoing \(\r{X}\) (versus \(\neg\r{X}\)) would raise theprobability of \(\r{Y}\). And assuming that the partitioning variableis not an effect of \(X\), itis impossible for doing\(\r{X}\) to raise the probability of \(\r{Y}\) in all subpopulations,but not in the population as a whole. So Pearl’s explanation ofthe paradox is that people conflate causal and non-causal expressions,and if the conditional probabilities in the examples are interpretedcausally, Simpson’s reversals are impossible.

3.5 Implications

Whether Pearl provides the correct causal explanation ofSimpson’s Paradox remains a topic of continued debate (Armistead2014 see alsoSection 4). What should not, however, be controversial is that recent causalmodeling techniques enable one to systematically distinguish betweencausal and probabilistic claims in a much more general and precise waythan had previously been possible. While Cartwright required that allcauses of \(E\) be included in the background context, for the sake ofeliminating confounding it is only necessary to hold fixedcommon causes (and other variables needed to block back-doorpaths). Theorists of probabilistic causality were to some extent awarethat one did not need to hold fixed all causes of the effect in orderto eliminate confounding, but they lacked a general account of whichvariable sets are sufficient for identifying the effect.Simpson’s Paradox was especially threatening, since there was noway to provide general conditions under which an apparent positivecausal relationship in a population would disappear entirely uponpartitioning. Using Pearl’s framework, it is trivial to showthat as long as one does not condition on mediators, if aprobabilistic expression identifies an average positive effect between\(X\) and \(Y\) in a population, intervening on \(X\) must raise\(Y\)’s effect in at leastsome subpopulations(Weinberger 2015).

Turning back to the debate about average effects in the probabilisticframework, this fact vindicates Dupré’s (1984) liberalattitude toward average effects against critics such as Eells andSober (1983: 54) who dismiss it as a “sorry excuse for a causalconcept” (though see Hitchcock 2003: 13–15, and Hausman2010: 56, for further nuances). Of course, a positive average effectis compatible with the cause lowering the probability of the effectsignificantly in many subpopulations. This reflects the fact that thepartitioning variable(s) could interact with the cause of interest.But such possible interactions do not make the effect any less genuineas an average effect for the whole population.

This brings us to the issue of whether Simpson’s Paradoxthreatens the objectivity of causal relationships. Properlyunderstood, it does not. It is certainly true that a cause can raisethe probability of its effect in one population and lower it inanother, or that it can have a positive effect in a whole population,but not in some of its subpopulations. But it is not as if only someof these causal relationships are genuine and that philosophers musttherefore find a privileged background context within which the truerelationship is revealed. It is simply a fact about causation thatdifferent populations can have different sets of interactivebackground factors, and thus the average effects will genuinely differacross the populations.

4. What Makes Simpson’s Paradox Paradoxical?

Simpson’s Paradox is not a paradox in the sense of presenting aninconsistent set of plausible propositions of which at least one mustbe rejected. As shown inSection 2.2, mathematics does not rule out associations to be reversed at thelevel of subpopulations. Bandyopadhyay et al. (2011) helpfullydistinguish between three questions one could ask aboutSimpson’s Paradox:

Why or in what sense is Simpson’s Paradox a paradox?
What is the proper analysis of the paradox?
How one should proceed when confronted with a typical case of theparadox?

Question (i) is essentially a question about the psychology ofreasoning: one must offer an account of why the (mathematicallyinnocent) association reversals seem paradoxical to many. Suchaccounts help to identify valid forms of inference that leadsindividuals to mistakenly rule out association reversals, and therebyprovide answers to question (ii). Such analyses can differentiateamong subtly different forms of reasoning, and open the door toempirical work testing whether humans systematically fail to attend toparticular differences.

Section 3.4 already presented one analysis of the paradox. On Pearl’scausal analysis, the appearance of a paradox resultsfrom aconflation between causal and probabilistic reasoning.If one interprets the claim that taking the drug raises theprobability of recovery as the causal statement that intervening togive the drug will make patients more likely to recover, and plausiblyassumes that taking the drug has no influence on gender, then the drugcannot lower the probability of recovery both among males and amongfemales. But, of course, if one is considering ordinary conditionalprobabilities without any do-operators, such reversals can occur.Accordingly, the appearance of paradox results from conflatingordinary conditional probabilities with conditional probabilitiesrepresenting the results of interventions.

Pearl’s answer to (ii) has immediate implications for (iii). Inevaluating the relationships between two variables \(X\) and \(Y\) anddetermining whether one should partition based on some variable (orvariable set) \(Z\), one should partition based on \(Z\) only if doingso will enable one toidentify the causal relationship between\(X\) and \(Y\). This answer presupposes that the aim ofpartitioning the population is to identify causal relationships.Questions about how to proceed in light of the paradox only make sensegiven a context and given the kind of inference one wishes todraw.

Pearl (2014) presents several reasons supporting his analysis of theparadox. First, he argues that were the surprise resulting from theparadox to be the result of a mere mathematical error, this couldneither account for why the paradox “has captured thefascination of statisticians, mathematicians, and philosophers forover a century” (2014: 9) nor for the difficulty that reasonershave in avoiding the error even once they’ve been made aware ofit. Only by means of a causal semantics can one demonstrate thatSimpson’s reversals cannot occur when the conditionalprobabilities are interpreted causally. Second, he points toSimpson’s (1951) observation that judgments about whether theaggregated or non-aggregated population is relevant for evaluating thecorrelations depends on thestory behind the what thefrequencies represent. Pearl accounts for this story-relativity byshowing that whether one should partition a population is decided notby the probabilities but rather by the causal model generating theprobabilities. These causal modelscannot be distinguished by conditional probabilities alone.

Bandyopadhyay et al. (2011) reject Pearl’s causal analysis ofthe paradox, and defend an alternativemathematicalexplanation. They note that there can be instances of theparadox that do not seem to invoke any causal notions. For example,suppose we take the proportions in Table 1 not to refer to theproportions of recovering/non-recovering patients among thetreatment/non-treatment groups in male and female populations, butrather to the proportions of red and blue marbles among big or smallmarbles in two bags. Suppose that in either bag the big marbles have ahigher red-to-blue ratio than the small marbles. Bandyopadhyay etal. plausibly claim that in this case, it would be surprising todiscover that, were we to pour the bags into a single box, the smallmarbles have a higher red-to-blue ratio than the big ones. If thereare cases of the paradox that still exhibit surprise despite havingnothing to do with causality, then the general explanation of theparadox cannot be causal.^[9]

Bandyopadhyay et al. rephrase the paradox as being about ratios andproportions: when it is the case that

\[\tag{6}\frac{a_1+b_1}{b_1} \gt \frac{c_1+d_1}{d_1} \quad \text{and} \quad \frac{a_2+b_2}{b_2} \gt \frac{c_2+d_2}{d_2} \]

—to be read as success proportions for treatment and control inthe subpopulations, compareTable 2—many people expect that these equalities are preserved in the overallpopulation:

\[\tag{7}\frac{a_1+a_2+b_1+b_2}{b_1+b_2} > \frac{c_1+c_2+d_1+d_2}{d_1+d_2}\]

As we know fromSection 2, this need not be the case. Bandyopadhyay et al. conducted a surveywith university students on this matter: only 12% give the correctanswer that equations(6), by themselves, do not constrain the truth value of equation(7).

Given the widespread literature revealing how seemingly error-pronehumans can be when reasoning about probabilities (e.g., Kahneman,Slovic, & Tversky 1982), the proposal that Simpson’s Paradoxcan be explained by appeal to an error in probabilistic reasoning isplausible. Yet Bandyopadhyay et al. do not specify what this error is.Or, more specifically, they do not propose a valid form of reasoningthat reasoners are mistakenly appealing to when falling prey to theparadox. The fact that people expect that the ratios in subpopulationsto be preserved in the combined population just shows that people aretricked by the paradox. It does not illuminate the underlying mistakethat they are making when they are tricked. In this sense,Bandyopadhyay et al. do not answer their second question. They also,by their own admission, do not provide a general answer to (iii). Theyview this as a virtue of their account, since they believe thatdiscussions of (iii) ought to be divorced from discussions of (i) and(ii).

Recently, Fitelson (2017) has proposed aconfirmation-theoretic explanation of Simpson’sParadox. His analysis relies on identifying confirmation withincreasing the (subjective) probability of a proposition. Statementsof the form “evidence \(\r{E}\) confirms hypothesis\(\r{H}\)” are, however, usually evaluated with respect tobackground knowledge K, and this can lead to ambiguities. Inparticular, Fitelson distinguishes between thesuppositionalandconjunctive readings of a confirmation statement. In ourrunning example, these statements would be as follows:

Suppositional (\(\bf \r{E}\) raises the probability of\(\bf\r{H}\) given \(\b{K}\)): If one is female, thenreceiving treatment increases one’s chance of recovery.

Conjunctive (\(\bf \r{E}\wedge\b{K}\) raises the probabilityof \(\bf \r{H}\)): Being a female treatment-receiverincreases one’s chance of recovery.

While the suppositional and conjunctive reading coincide for someaccounts of confirmation (e.g., Carnap’s account of degree ofconfirmation as conditional probability), they can produce differentoutcomes for confirmation asprobability-raising. For ourdata inTable 1, the suppositional reading is true: if one is in the femalesubpopulation, receiving treatment rather than being in the controlgroup increases one’s chances of recovery. On the conjunctivereading, however, the statement is false: female treatment-receiversare less-likely to recover (12/27) compared to the set of individualswho are either male or did not receive the treatment (16/25). Moreimportantly, while the suppositional reading allows for associationreversals, on the conjunctive reading it cannot be the case both thatbeing a female treatment-receiver and being a male treatment-receiverraises the probability of recovery, but being a treatment receiversimpliciter does not (Fitelson 2017: 300–302).

Fitelson’s confirmation-theoretic explanation of Simpson’sParadox is that reasoners are not attentive to the difference betweenthe suppositional and conjunctive readings of confirmation statementswhen considering the evidential relevance of learning anindividual’s gender. In the conjunctive reading there cannot beassociation reversals, and because the suppositional and conjunctivereading do not differ for many accounts of confirmation, peoplemistakenly assume that there cannot be such reversals, even when theyare relying on the suppositional reading.

Both Bandyopadhyay et al. and Fitelson claim that because theformulation of Simpson’s paradox does not itself appeal tocausal considerations, it is a preferable to find a non-causalexplanation for the paradox. Ultimately, it is an empirical questionwhether the paradox can be accounted for exclusively by errors inprobabilistic reasoning, or, as Pearl suggests, due to a conflation ofcausal and probabilistic reasoning. One conceptual barrier todisentangling these hypotheses is that there are systematicrelationships between causal and probabilistic claims For example,when the third variable \(\r{M}\) is uncorrelated with treatment \(T\)(i.e., \(p(\r{T}\mid \r{M}) = p(\r{T})\)), there can be no reversals(see also the theorems inSection 2.2). Does it follow that Simpson’s Paradox has a purelyprobabilistic explanation? Not necessarily. An alternative hypothesisis that the epistemic agent does not have knowledge of the relevantconditional probabilities, but does know that \(M\) is not a cause of\(T\) \((p(\r{T}\mid \do(\r{M})) = p(\r{T})\)), preempting theoccurrence of association reversals. The question of whether thesource of the paradox is causal cannot be resolved purely by appeal tothe mathematical conditions under which the it arises. Rather, itdepends on substantive psychological hypotheses about the role ofcausal and probabilistic assumptions in human reasoning.^[10]

The empirical evidence on the paradox shows that reasoners findtrivariate reasoning (i.e., with a causally relevant third variable)generally hard and fail to take its role properly into account, evenif salient cues to its relevance are provided (Fiedler, Walther,Freytag, & Nickel 2003). Other studies point to the facilitativeeffect of causal model, statistical training and high motivation(Schaller 1992; Waldmann & Hagmayer 1995), but the significantdifficulties that reasoners encounter in Simpson-like tasks make itunlikely that the question of the right analysis of the paradox willsoon be decided empirically.

5. Applications

5.1 Non-Categorical Data and Linear Regression

Grade Point Average (GPA)	Distribution of Grades		Verbal SAT scores
Grade Point Average (GPA)	1992	2002	1992	2002
A+	5%	7%	619	607
A	12%	17%	575	565
A−	14%	17%	546	538
B	52%	47%	486	479
C	17%	11%	434	424
All grades	100%	100%	501	516

Table 4: Verbal SAT score data forAmerican high schools, taken from Rinott & Tam (2003).

Simpson’s Paradox is not limited to categorical data: it canoccur for cardinal data as well and show up in standard models forquantitative analysis. A famous example is the analysis of SATscores—the results of college admission tests—in theUnited States as a function of the high school grade point average(GPA) of students. The data are given inTable 4: the overall SAT average rises from 1992 to 2002, but for each GPAgroup (A+/A/…), SAT averages are falling. This phenomenon is,however, very natural. As soon as there is a bit of grade inflation athigh schools, each group loses their best students to the next highergroup, lowering the SAT average per group. But this is of courseconsistent with the overall SAT average remaining equal, or evenrising from 501 to 516, like in our dataset. A conclusion from thestratified data that “students are getting more stupid”would be mistaken. Since societal developments such as grade inflationaffect both the grade distribution and the SAT scores, one shouldnot condition on the GPA of a student when studying SATscores over time (compare the back-door criterion fromSection 3.4).^[11]

a graph with a y-axis of IQ score ranging from 70 to 130 and an x-axis of cups of coffee on test day ranging from 0 to 5. A dashed line goes from a point of about 0 cups and 75 IQ to 5 cups and 130 IQ. 8 clusters of dots appear along the line.

Figure 4: A linear regression model thatillustrates Simpson’s Paradox for bivariate cardinal data. Eachcluster of values corresponds to a single person (repeatedmeasurement).

A similar example is presented inFigure 4, adapted from Kievit, Frankenhuis, Waldorp, and Borsboom (2013). Thefigure shows the results of coffee intake on performance at an IQ test. Supposethat coffee actually decreases performance slightly because itmakes drinkers more nervous and less focused. At the same time, coffeeintake co-varies with education level (construction workers are toobusy for drinking coffee all the time!) and education level co-varieswith test performance. When we measure performance repeatedly fordifferent individuals, we see that their performance is slightlynegatively affected by their coffee intake. However, the(unconditional) regression model of performance as a function ofcoffee intake suggests misleadinglythat coffee consumption strongly improves performance! The reasonfor the confounding is the causal impact of the hidden covariate,education level, on both coffee consumption and performance. Similarto the results fromsection 2, Simpson’s Paradox in linear models can be characterizedformally by means of inequalities among regression coefficients (e.g.,Pearl 2013), and its occurrence depends on the nature of the causalinteraction between the involved variables.

5.2 Epidemiology and Meta-Analysis

Simpson’s Paradox in its various forms has attracted a lot ofattention in the epidemiological literature since it is relevant fordetermining and estimating the effect size of medical treatments, andthe effect of exposure to risk factors (e.g., smoking, alcohol) onmedical hazards.

One of the aims behind the methodology of randomized controlled trials(RCTs) is to eliminate the effect of potential confounders on whethera person is treated or not. This was described inSection 2.2 as row-uniform design (for experiments with categorical data). Forexample, if we ensure the same proportion of both genders in thetreatment and control group, the same prevalence of different agegroups, etc., we know that association reversal (AR) cannot occur withrespect to those third variables, and also the amalgamation paradox(AMP) is ruled out for many measures.

However, the (log-)odds ratio, a popular measure of effect size inepidemiological research, shows a deviant behavior. Uniformlyassigning individuals to treatment and control condition reliablyproduces the AMP for the odds ratio whenever the thirdvariable (=the subpopulation attribute) influences the success rate,given the treatment level (Theorem 2.4, Samuels 1993). The odds ratiois thus a particularly tricky association measure. Greenland (1987)gives the instructive example of an odds ratio that is equal in allsubpopulations with row-uniform design, but halved when data arepooled.

Meta-analytic problems, such as pooling variousstudies for determining the overall effect size of an intervention orrisk factor give a particularly interesting twist to Simpson’sParadox. How should such studies be aggregated? Naïvely, somebodymay suggest to pool the data from all studies and to treat them as asingle big study. This may work out if the study populations are verysimilar and the data are from RCTs, where the treatment/control ratiois typically 50:50. If this is indeed the case, then the overalldataset is row-uniform and AR (and for most measures, AMP) is avoided,as shown inSection 2.2. But for non-experimental data, there is no reason to assume thattreatment/control proportions are equal across studies. Thus, thedirection of the direction of the effect can be reversed when pooling(for examples, see Hanley & Thériault 2000; Reintjes, Boer,Pelt, & Mintjes-de Groot 2000; Rücker & Schumacher2008).

Another reason for not pooling the data is that study populations areoften heterogenous and that calculating the strength of association(i.e., the effect size) on the basis of the data may bias the estimatein the direction of the study with the largest sample size, while thecharacteristics of patients in that study need not be representativeof the target group as a whole. In particular, while at the level ofstudies patients are usually assigned randomly to the treatment orcontrol group, this cannot be said about the aggregate data (Cates2002). Proper meta-analysis therefore proceeds on the basis ofweighting the effects rather than pooling the data, either by a fixedeffects model or (e.g., if the study populations are heterogenous) byintroducing a random effect of the study in the statistical model. Thequestion of how to conduct a meta-analysis of epidemiological studiesis also entangled with the choice of an association or effect sizemeasure (Altman & Deeks 2002; Cates 2002; Greenland 1987), aquestion discussed inSection 2.2.

5.3 Decision Theory and the Sure-Thing Principle

Blyth (1972) argued that Simpson’s Paradox also constitutes acounterexample to the sure-thing principle of decision theory, or atleast restricts its scope substantially. That principle is supposed toguide rational decisions under uncertainty, and has been stated bySavage as follows:

Sure-Thing Principle (STP) “If youwould definitely prefer \(g\) to \(f\) either knowing that the event\(\r{B}\) obtained, or knowing that the event B did not obtain, thenyou definitely prefer \(g\) to \(f\)”. (Savage 1954:21–22)

In his purported counterexample, Blyth treats \(\r{B}\) and\(\neg\r{B}\) as indicating the two subpopulations (e.g., twodifferent hospitals). Suppose that treatment \(\r{T}\) is positivelyassociated with recovery \(\r{R}\) for each subpopulation. In thatcase, assuming equal odds, we would rather bet on the recovery of apatient in the treatment group (action \(g\)) than on the recovery onthe patient in a control group (action \(f\))—regardless ofwhether that person is in group \(\r{B}\) or group \(\neg\r{B}\).Thus, since we prefer \(g\) to \(f\) in either subpopulation, andsince all patients are either in group \(\r{B}\) or in group\(\neg\r{B}\), we can infer, by the Sure-Thing Principle, that \(g\)is preferable to \(f\) also when we don’t know whether a patientis in group \(\r{B}\) or group \(\neg\r{B}\). But this inference ismistaken if association reversal occurs: it is perfectly compatiblewith the above scenario that the overall frequency of recovery ishigher for non-treated than for treated patients! Blyth (1972: 366)concludes that

the Sure-Thing Principle […] seems not applicable to situationsin which any action taken within \(f\) or \(g\) […] is allowedto be based sequentially on events dependent with [\(B\)].

5.4 Philosophy of Biology and Natural Selection

Within the philosophy of biology, theunits ofselection debate (Sober 2000 [2018: ch. 4], 2014; Williams1966) concerns whether natural selection operates only at the level ofthe individual or also on groups (where the individual is typicallyconceived either as the organism or the gene). This debate isespecially important for understanding the evolution of altruism(Sober & Wilson 1999). Since altruistic individuals harm their ownchances of survival and reproduction, they are less fit, and it isthus unclear how altruism could evolve as a result of naturalselection. If, however, groups with more altruists are fitter thangroups with fewer, and selection can act on groups, this couldpotentially explain how altruism could still evolve. Within the unitsof selection debate, Simpson reversals have played animportant role in explaining the possibility of group-levelselection.

Consider the following naive argument against the conceptualpossibility of group-level selection.^[12] Suppose that we define the fitness of a group as the average fitnessof its individuals. In this context, altruistic individuals are, bydefinition, those with traits that reduce their individual fitnesswhile improving the fitness of other group members. For instance,crows that issue warning cries when a predator approaches benefit thegroup while increasing the chances of being harmed themselves. Naturalselection explains the evolution of traits on the basis thatindividuals with the trait are fitter than those without it (all elsebeing equal). Since selfish individuals are by definition fitter thanaltruistic ones, it follows that groups with more altruisticindividualscannot be fitter. Or so one might argue.

By now it should be clear what is wrong with this type ofargument—it does not follow from the fact that altruisticindividuals are less fit than selfish ones in every population thatpopulations which average over selfish and altruistic individualscannot be fitter than populations with just selfish individuals. Itcould be that being an altruist is correlated with being in apopulation with more altruists, and that populations with morealtruists are fitter. This dispenses with the naive argument. Note,however, that within every single group selfish individuals arefitter, so if the groups change membership only through reproduction(as opposed to migration and mutation) then over enough generationsevery group will end up consisting only of selfish individuals. Sowhether groups selection can occur depends on additional facts aboutpopulation structure and dynamics. Hamilton’s (1964) KinSelection theory explains how altruism can evolve in cases wherealtruists are more likely to associate with other altruists (possiblybecause it runs in the family).

The group selection hypothesis remains controversial among biologists.The present discussion reveals how the phenomenon of Simpson’sParadox is relevant to theorizing how it might be possible, and morebroadly reveals how philosophical work on causation and probabilitycan aid in clarifying scientific debates.

Recently, Simpson’s Paradox has been invoked in an ongoingdebate regarding whether natural selection should be understood ascausal or statistical. Walsh (2010), a prominent defender of thestatistical view, points to cases of Simpson’s Paradox asshowing that selection cannot be understood causally, and Otsuka,Turner, Allen, & Lloyd (2011) rebut this claim. An important pointthat emerges from this debate is that the term“population” is used differently in discussions ofSimpson’s Paradox than it is in biology (cf. Weinberger 2018).Walsh presents an example in which a correlation in a populationdisappears when one splits the population into two parts. As Otsuka etal. point out, within population genetics, population size can becausally relevant to the fitness of its individuals. Note thatWalsh’s example of dividing a population in half is not what wehave been talking about in the context of Simpson’s Paradox. Inthe prior discussion, dividing the population was not a matter ofchanging its size, but rather of partitioning its probabilitydistribution based on a variable.

5.5 Policy Questions: Interpreting Data on Discrimination

Bickel et al. (1975) present a classic example of Simpson’sParadox involving a study of gender discrimination at Berkeley. Thedata revealed that men were more likely than women to be accepted tothe university’s graduate programs, but the authors were unableto detect a bias towards men in any individual department. The authorsuse the paradox to explain why the higher university-wide acceptancerate for men does not show that any department discriminated againstwomen. Specifically, women were more likely to apply to departmentswith lower acceptance rates. This leads to a probabilistic associationbetween gender and the partitioning variable (department), which wehave seen can lead to Simpson’s reversals.

While the probabilistic structure of the Berkeley case is similar toother instances of the paradox, it raises an additional question. On anatural way to understand the case, the applicant’s gender is acause of his or her applying to a more or less selective departments.Exactly what it means for demographic variables such as gender or raceto be a cause is a longer story for another day (Glymour & Glymour2014; Sen & Wasow 2016). But assuming that gender is a cause here,then the department variable is amediator, and one shouldnot condition on mediators in evaluating the mediated causalrelationship. So what is the justification for conditioning ondepartment?

The answer is that in evaluating discrimination, what often mattersarepath-specific effects, rather than the net effectalong all paths (Pearl 2000 [2009: 4.5.3]; Zhang & Bareinboim2018). To give a different example (Pearl 2001), consider whether ahypothetical black job candidate was discriminated against based onher race. It is possible that as a result of prior racialdiscrimination, the candidate was denied opportunities to developjob-relevant qualifications, and as a result of lacking thesequalifications was denied the job. This indirect effect of race hiringwould not be relevant for determining whether an employerdiscriminated against the candidate. Rather, what matters is whetherthe employee would have been more likely to get the position had shebeen white, but had the same qualifications that she does as a resultof being black. This is called thenatural directeffect (Pearl 2001; Weinberger 2019). In determining whetherthe employer discriminated, what matters is not whether being blackmadeany difference in the person’s being hired, butrather whether their being black had a direct influence not via theirjob-relevant qualifications.

5.6 Using Statistics to Evaluate Task Performance

The common explanation for the Berkeley data, on which the paradoxresults from women applying to more selective department, points to alarger class of cases in which it is important to account fordifferences in the difficulty-level across tasks. In baseball, forinstance, it appears that over time batters have been striking outmore frequently, despite their improving in their ability to hit moredifficult pitches while remaining as good at hitting less difficultones (Watt 2016 [seeOther Internet Resources]). This could be accounted for by the fact thatpitchers have been throwing a higher proportion of difficult-to-hitpitches. This highlights the way that statistics about success ratesin performing a task can be misleading in cases where thetask-difficulty changes over time.

6. Conclusions

Simpson’s Paradox is not only a surprising mathematical fact; itserves as a lens through which to understand the role of probabilitiesin data analysis, causal inference, and decision-making. In thisarticle, we have characterized its mathematical properties, givennecessary and sufficient conditions for its occurrence, discussed itsinfluence on theories of causality, evaluated competing theories ofthe nature of the paradox, and surveyed its applications in a range ofempirical domains.

Although Simpson’s Paradox has been known for over a century andhas a straightforward probabilistic analysis, we predict that it willremain a source of fruitful philosophical discussion. Pearl’scausal analysis of the paradox is relatively recent, and it is onlynow that graphical causal models are starting to play a central rolein philosophical discussions of the paradox. Despite the continuitybetween graphical accounts and earlier probabilistic theories ofcausality, here we have highlighted ways in which the newer methodslead one to draw substantially different implications from theparadox. Pearl’s account renders certain debates from theearlier literature moot, while opening up new debates about the properinterpretation of the paradox. The responses to Pearl considered insection 4 are only the first steps in a broader discussion about therelationships between causation, probability theory, and thepsychology of reasoning. There remains room to clarify what it meansto explain the paradox, and what counts as empirical support for aparticular explanation. Such work would open the door to empiricaltesting, which has thus far been limited.

Finally, we would like to highlight connections betweenSimpson’s Paradox and other reasoning fallacies in theliterature. First, thebase rate fallacy is related toSimpson’s Paradox since the illusion that association reversalis impossible may be based on a neglect of the different base ratesfor treated and untreated people, given the third variable (Bar-Hillel1990). Second, thefallacy of mistaking correlation forcausation may contribute to the appearance of paradoxicalitysince association reversal implies two contradicting causal claims,when combined with this fallacy. Third, in both Simpson’sParadox and theMonty Hall fallacy reasoners fail to see theprobabilistic relevance of causal information. While inSimpson’s Paradox, reasoners ignore the relevance of a back-doorpath for an observed association, in the Monty Hall problem, reasonersfail to take into account how Monty’s action depends on hisknowledge of what is in back of the doors. Fourth, and last, thecapacity of reasoners to detect the causes of association reversalalso depends on the extent of theconfirmation bias to whichthey are exposed (e.g., whether or not they find a discriminationmechanism plausible). We are unaware of systematic research into theconnection between Simpson’s Paradox and these reasoningfallacies, but this could be a fruitful field for future research.There is perhaps nothing paradoxical about Simpson’s Paradox,but since we often struggle to understand it, our reasoning aboutassociation reversals may be entangled with various forms of reasoningthat are susceptible to bias and error.

Bibliography

Altman, Douglas G and Jonathan J Deeks,2002, “Meta-Analysis, Simpson’s Paradox, and the NumberNeeded to Treat”,BMC Medical Research Methodology, 2:art. 3. doi:10.1186/1471-2288-2-3
Armistead, Timothy W., 2014,“Resurrecting the Third Variable: A Critique of Pearl’sCausal Analysis of Simpson’s Paradox”,The AmericanStatistician, 68(1): 1–7.doi:10.1080/00031305.2013.807750
Bandyopadhyay, Prasanta S., DavinNelson, Mark Greenwood, Gordon Brittan, and Jesse Berwald, 2011,“The Logic of Simpson’s Paradox”,Synthese,181(2): 185–208. doi:10.1007/s11229-010-9797-0
Bar-Hillel, Maya, 1990, “Back to BaseRates”, inInsights in Decision Making: A Tribute to HillelJ. Einhorn, Robin M. Hogarth (ed.), Chicago: University ofChicago Press, pp. 200–216.
Bickel, P. J., E. A. Hammel, and J. W.O’Connell, 1975, “Sex Bias in Graduate Admissions: Datafrom Berkeley”,Science, 187(4175): 398–404.doi:10.1126/science.187.4175.398
Blyth, Colin R., 1972, “OnSimpson’s Paradox and the Sure-Thing Principle”,Journal of the American Statistical Association, 67(338):364–366. doi:10.1080/01621459.1972.10482387
–––, 1973,“Simpson’s Paradox and Mutually Favorable Events”,Journal of the American Statistical Association, 68(343):746–746. doi:10.1080/01621459.1973.10481419
Cartwright, Nancy, 1979, “CausalLaws and Effective Strategies”,Noûs, 13(4):419–437. doi:10.2307/2215337
–––, 1989,Nature’s Capacities and Their Measurement, Oxford:Clarendon Press. doi:10.1093/0198235070.001.0001
Cates, Christopher J, 2002,“Simpson’s Paradox and Calculation of Number Needed toTreat from Meta-Analysis”,BMC Medical ResearchMethodology, 2: art. 1. doi:10.1186/1471-2288-2-1
Dupré, John, 1984, “ProbabilisticCausality Emancipated”,Midwest Studies in Philosophy,9: 169–175. doi:10.1111/j.1475-4975.1984.tb00058.x
Edwards, A. W. F., 1963, “The Measureof Association in a 2 × 2 Table”,Journal of the RoyalStatistical Society. Series A (General), 126(1): 109.doi:10.2307/2982448
Eells, Ellery, 1986, “ProbabilisticCausal Interaction”,Philosophy of Science, 53(1):52–64. doi:10.1086/289291
–––, 1991,ProbabilisticCausality, Cambridge: Cambridge University Press.doi:10.1017/CBO9780511570667
Eells, Ellery and Elliott Sober, 1983,“Probabilistic Causality and the Question ofTransitivity”,Philosophy of Science, 50(1):35–57. doi:10.1086/289089
Fiedler, Klaus, Eva Walther, PeterFreytag, and Stefanie Nickel, 2003, “Inductive Reasoning andJudgment Interference: Experiments on Simpson’s Paradox”,Personality and Social Psychology Bulletin, 29(1):14–27. doi:10.1177/0146167202238368
Fitelson, Branden, 1999, “ThePlurality of Bayesian Measures of Confirmation and the Problem ofMeasure Sensitivity”,Philosophy of Science,66(Supplement): S362–S378. doi:10.1086/392738
–––, 2017,“Confirmation, Causation, and Simpson’s Paradox”,Episteme, 14(3): 297–309. doi:10.1017/epi.2017.25
Fitelson, Branden and ChristopherHitchcock, 2011, “Probabilistic Measures of CausalStrength”, in Phyllis McKay Illari, Federica Russo, & JonWilliamson (eds.),Causality in the Sciences, Oxford: OxfordUniversity Press, pp. 600–627.
Glymour, Clark and Madelyn R. Glymour, 2014,“Commentary: Race and Sex Are Causes”,Epidemiology, 25(4): 488–490.doi:10.1097/EDE.0000000000000122
Good, I. J., 1960, “Weight of Evidence,Corroboration, Explanatory Power, Information and the Utility ofExperiments”,Journal of the Royal Statistical Society:Series B (Methodological), 22(2): 319–331.doi:10.1111/j.2517-6161.1960.tb00378.x
Good, I. J. and Y. Mittal, 1987,“The Amalgamation and Geometry of Two-by-Two ContingencyTables”,The Annals of Statistics, 15(2):694–711. doi:10.1214/aos/1176350369
Greenland, Sander, 1987,“Interpretation and Choice of Effect Measures in EpidemiologicAnalyses”,American Journal of Epidemiology, 125(5):761–768. doi:10.1093/oxfordjournals.aje.a114593
Hamilton, William D., 1964, “TheGenetical Evolution of Social Behaviour. II”,Journal ofTheoretical Biology, 7(1): 17–52.doi:10.1016/0022-5193(64)90039-6
Hanley, James A. and GillesThériault, 2000, “Simpson’s Paradox inMeta-Analysis:”,Epidemiology, 11(5): 613.doi:10.1097/00001648-200009000-00022
Hardcastle, Valerie Gray, 1991,“Partitions, Probabilistic Causal Laws, and Simpson’sParadox”,Synthese, 86(2): 209–228.doi:10.1007/BF00485809
Hausman, Daniel M., 2010,“Probabilistic Causality and Causal Generalizations”, inThe Place of Probability in Science, Ellery Eells and J.H.Fetzer (eds.), (Boston Studies in the Philosophy of Science 284),Dordrecht: Springer Netherlands, 47–63.doi:10.1007/978-90-481-3615-5_2
Hesslow, Germund, 1976, “Two Notes onthe Probabilistic Approach to Causality”,Philosophy ofScience, 43(2): 290–292. doi:10.1086/288684
Hitchcock, Christopher, 2001, “ATale of Two Effects”,The Philosophical Review, 110(3):361–396. doi:10.2307/2693649
–––, 2003, “OfHumean Bondage”,The British Journal for the Philosophy ofScience, 54(1): 1–25. doi:10.1093/bjps/54.1.1
Hoover, Kevin D., 2003, “NonstationaryTime Series, Cointegration, and the Principle of the CommonCause”,The British Journal for the Philosophy ofScience, 54(4): 527–551. doi:10.1093/bjps/54.4.527
Imbens, Guido W. and Joshua D. Angrist, 1994,“Identification and Estimation of Local Average TreatmentEffects”,Econometrica, 62(2): 467–475.doi:10.2307/2951620
Jeffrey, Richard, 1982, “The Sure ThingPrinciple”,PSA: Proceedings of the Biennial Meeting of thePhilosophy of Science Association. Part 2: Symposia and InvitedPapers, Chicago: University of Chicago Press, 719–730.doi:10.1086/psaprocbienmeetp.1982.2.192456
Kahneman, Daniel, Paul Slovic,and Amos Tversky (eds.), 1982,Judgment under Uncertainty:Heuristics and Biases, Cambridge: Cambridge University Press.doi:10.1017/CBO9780511809477
Kievit, Rogier A., Willem E. Frankenhuis,Lourens J. Waldorp, and Denny Borsboom, 2013, “Simpson’sParadox in Psychological Science: A Practical Guide”,Frontiers in Psychology, 4: art 513.doi:10.3389/fpsyg.2013.00513
Lindley, Dennis V. and Melvin R.Novick, 1981, “The Role of Exchangeability in Inference”,The Annals of Statistics, 9(1): 45–58.doi:10.1214/aos/1176345331
Malinas, Gary, 2001, “Simpson’sParadox: A Logically Benign, Empirically Treacherous Hydra”,Monist, 84(2): 265–283.doi:10.5840/monist200184217
Mittal, Yashaswini, 1991, “Homogeneityof Subpopulations and Simpson’s Paradox”,Journal ofthe American Statistical Association, 86(413): 167–172.doi:10.1080/01621459.1991.10475016
Nagel, Ernest and Morris R. Cohen, 1934,AnIntroduction to Logic and Scientific Method, New York: Harcourt,Brace.
Otsuka, Jun, Trin Turner, Colin Allen, andElisabeth A. Lloyd, 2011, “Why the Causal View of FitnessSurvives”,Philosophy of Science, 78(2): 209–224.doi:10.1086/659219
Otte, Richard, 1985, “ProbabilisticCausality and Simpson’s Paradox”,Philosophy ofScience, 52(1): 110–125. doi:10.1086/289225
Pavlides, Marios G. and Michael D.Perlman, 2009, “How Likely Is Simpson’s Paradox?”,The American Statistician, 63(3): 226–233.doi:10.1198/tast.2009.09007
Pearl, Judea, 1988,Probabilistic Reasoningin Intelligent Systems: Networks of Plausible Inference, SanMateo, CA: Morgan Kaufmann.
–––, 1993, “[BayesianAnalysis in Expert Systems]: Comment: Graphical Models, Causality andIntervention”,Statistical Science, 8(3):266–269. doi:10.1214/ss/1177010894
–––, 2000 [2009],Causality: Models, Reasoning, and Inference, Cambridge:Cambridge University Press. Second edition 2009.doi:10.1017/CBO9780511803161
–––, 2001, “Direct andIndirect Effects”, in Jack Breese & Daphne Koller (eds.),Proceedings of the Seventeenth Conference on Uncertainty inArtificial Intelligence, San Francisco, CA: Morgan Kaufmann, pp.411–420.
–––, 2013, “LinearModels: A Useful ‘Microscope’ for Causal Analysis”,Journal of Causal Inference, 1(1): 155–170.doi:10.1515/jci-2013-0003
–––, 2014, “Comment:Understanding Simpson’s Paradox”,The AmericanStatistician, 68(1): 8–13.doi:10.1080/00031305.2014.876829
–––, 2016, “TheSure-Thing Principle”,Journal of Causal Inference,4(1): 81–86. doi:10.1515/jci-2016-0005
Pearson, Karl, 1899, “On the Theory ofGenetic (Reproductive) Selection”,Philosophical Transactionof the Royal Society, Series A, 192: 260–278.
Peirce, C. S., 1884, “The NumericalMeasure of the Success of Predictions”,Science, newseries 4(93): 453–454. doi:10.1126/science.ns-4.93.453-a
Peters, Jonas, Dominik Janzing, and BernhardSchölkopf, 2017,Elements of Causal Inference: Foundationsand Learning Algorithms, Cambridge, MA: MIT press.
Reichenbach, Hans, 1956,TheDirection of Time, Berkeley, CA: University of CaliforniaPress.
Reintjes, Ralf, Annette de Boer, Wilfridvan Pelt, and Joke Mintjes-de Groot, 2000, “Simpson’sParadox: An Example from Hospital Epidemiology:”,Epidemiology, 11(1): 81–83.doi:10.1097/00001648-200001000-00017
Rinott, Yosef and Michael Tam, 2003,“Monotone Regrouping, Regression, and Simpson’sParadox”,The American Statistician, 57(2):139–141. doi:10.1198/0003130031397
Rücker, Gerta and MartinSchumacher, 2008, “Simpson’s Paradox Visualized: TheExample of the Rosiglitazone Meta-Analysis”,BMC MedicalResearch Methodology, 8: art. 34. doi:10.1186/1471-2288-8-34
Samuels, Myra L., 1993,“Simpson’s Paradox and Related Phenomena”,Journal of the American Statistical Association, 88(421):81–88. doi:10.1080/01621459.1993.10594297
Savage, Leonard J., 1954,The Foundationsof Statistics, New York: Wiley. Second revised edition 1972.
Schaller, Mark, 1992, “In-GroupFavoritism and Statistical Reasoning in Social Inference: Implicationsfor Formation and Maintenance of Group Stereotypes.”,Journal of Personality and Social Psychology, 63(1):61–74. doi:10.1037/0022-3514.63.1.61
Sen, Maya and Omar Wasow, 2016, “Race as aBundle of Sticks: Designs That Estimate Effects of Seemingly ImmutableCharacteristics”,Annual Review of Political Science,19: 499–522. doi:10.1146/annurev-polisci-032015-010015
Simpson, E. H., 1951, “TheInterpretation of Interaction in Contingency Tables”,Journal of the Royal Statistical Society: Series B(Methodological), 13(2): 238–241.doi:10.1111/j.2517-6161.1951.tb00088.x
Skyrms, Brian, 1980,Causal Necessity: aPragmatic Investigation of the Necessity of Laws, New Haven, CT:Yale University Press.
Sober, Elliott, 2000 [2018],Philosophy ofBiology, New York: Westview Press. Second edition, New York:Routledge, 2018.
–––, 2014,The Nature ofSelection: Evolutionary Theory in Philosophical Focus, Universityof Chicago Press.
Sober, Elliott and David Sloan Wilson, 1999,Unto Others: The Evolution and Psychology of UnselfishBehavior, Cambridge, MA: Harvard University Press.
Spirtes, Peter, Clark Glymour, and RichardScheines, 2000,Causation, Prediction, and Search, secondedition, Cambridge, MA: MIT Press.
Spohn, Wolfgang, 2012,The Laws of Belief:Ranking Theory and Its Philosophical Applications, Oxford: OxfordUniversity Press. doi:10.1093/acprof:oso/9780199697502.001.0001
Sprenger, Jan, 2018, “Foundations ofa Probabilistic Theory of Causal Strength”,ThePhilosophical Review, 127(3): 371–398.doi:10.1215/00318108-6718797
Sprenger, Jan and Jacob Stegenga,2017, “Three Arguments for Absolute Outcome Measures”,Philosophy of Science, 84(5): 840–852.doi:10.1086/693930
Stern, Reuben, 2019, “Decision andIntervention”,Erkenntnis, 84(4): 783–804.doi:10.1007/s10670-018-9980-0
Suppes, Patrick, 1970,A ProbabilisticTheory of Causality, Amsterdam: North-Holland.
Waldmann, Michael and York Hagmayer,1995, “Causal Paradox: When a Cause Simultaneously Produces andPrevents an Effect”, in J. D. Moore & J. F. Lehman (eds.),Proceedings of the Seventeenth Annual Conference of the CognitiveScience Society, Mahwah, NJ: Erlbaum, pp. 425–430.
Walsh, Denis M., 2010, “Not a SureThing: Fitness, Probability, and Causation*”,Philosophy ofScience, 77(2): 147–171. doi:10.1086/651320
Weinberger, Naftali, 2015, “IfIntelligence Is a Cause, It Is a within-Subjects Cause”,Theory & Psychology, 25(3): 346–361.doi:10.1177/0959354315569832
–––, 2018,“Faithfulness, Coordination and Causal Coincidences”,Erkenntnis, 83(2): 113–133.doi:10.1007/s10670-017-9882-6
–––, 2019,“Path-Specific Effects”,The British Journal for thePhilosophy of Science, 70(1): 53–76.doi:10.1093/bjps/axx040
Williams, George C., 1966,Adaptationand Natural Selection: A Critique of Some Current EvolutionaryThought (Princeton Science Library), Princeton, NJ: PrincetonUniversity Press.
Woodward, James, 2003,Making ThingsHappen: A Theory of Causal Explanation, Oxford: Oxford UniversityPress. doi:10.1093/0195155270.001.0001
Yule, G. Undy, 1903, “Notes on the Theoryof Association of Attributes in Statistics”,Biometrika, 2(2): 121–134.doi:10.1093/biomet/2.2.121
Zhang, Junzhe and Elias Bareinboim, 2018,“Fairness in Decision-Making—the Causal ExplanationFormula”, inThirty-Second AAAI Conference on ArtificialIntelligence, 2037–2045. [Zhang and Bareinboim 2018 available online]

Academic Tools

How to cite this entry.
Preview the PDF version of this entry at theFriends of the SEP Society.
Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO).
Enhanced bibliography for this entryatPhilPapers, with links to its database.

Other Internet Resources

Kügelgen, Julius von, Luigi Gresele, andBernhard Schölkopf, “Simpson’s Paradox in Covid-19 Case Fatality Rates: A Mediation Analysis of Age-Related Causal Effects”, version of 24 June 2020. arXiv Preprint arXiv:2005.07180.
Watt, Rian, 2016, “Pitchers Won’tThrow Strikes, so Batters Are Getting Better at Hitting BadPitches”,FiveThirtyEight, 12 December 2016, accessed 3March 2021.

Acknowledgments

This research was supported by the European Research Councilthrough Starting Investigator Grant No. 640638 (J.S.), the ItalianMinistry for University and Research through PRIN project “FromModels to Decisions” (J.S.) and a research fellowship of theAlexander Humboldt Foundation (N.W.). The authors would like to thankthe editors for their invitation to contribute to the StanfordEncyclopedia of Philosophy, Reuben Stern for helpful feedback, andJudea Pearl for extensive comments on a previous draft. The authorshave no conflicts of interests.

Open access to the SEP is made possible by a world-wide funding initiative.
The Encyclopedia Now Needs Your Support
Please Read How You Can Help Keep the Encyclopedia Free

Browse

About

Support SEP

Mirror Sites

View this site from another server:

USA (Main Site)Philosophy, Stanford University

Info about mirror sites

Library of Congress Catalog Data: ISSN 1095-5054

Movatterモバイル変換