Statistics investigates and develops specific methods forevaluating hypotheses in the light of empirical facts. A method iscalled statistical, and thus the subject of study in statistics, if itrelates facts and hypotheses of a particular kind: the empirical factsmust be codified and structured into data sets, and the hypothesesmust be formulated in terms of probability distributions over possibledata sets. The philosophy of statistics concerns the foundations andthe proper interpretation of statistical methods, their input, andtheir results. Since statistics is relied upon in almost allempirical scientific research, serving to support and communicatescientific findings, the philosophy of statistics is of key importanceto the philosophy of science. It has an impact on the philosophicalappraisal of scientific method, and on the debate over the epistemicand ontological status of scientific theory.
The philosophy of statistics harbors a large variety of topics anddebates. Central to these is theproblem of induction,which concerns the justification of inferences or procedures thatextrapolate from data to predictions and general facts. Furtherdebates concern theinterpretation of the probabilitiesthat are used in statistics, and the wider theoretical frameworkthat may ground and justify the correctness of statistical methods. Ageneral introduction to these themes is given inSection 1 andSection 2.Section 3 andSection 4 provide an account of howthese themes play out in the two major theories of statistical method,classical and Bayesian statistics respectively.Section 5directs attention to the notion of a statistical model, covering modelselection and simplicity, but also discussing statistical techniquesthat do not rely on statistical models.Section 6 brieflymentions relations between the philosophy of statistics and severalother themes from the philosophy of science, includingconfirmation theory,evidence, causality, measurement, and scientificmethodology in general.
Statistics is a mathematical and conceptual discipline that focuses on the relationbetween data and hypotheses. Thedata are recordings ofobservations or events in a scientific study, e.g., a set ofmeasurements of individuals from a population. The data actuallyobtained are variously called thesample, thesampledata, or simply thedata, and all possible samples froma study are collected in what is called asamplespace. Thehypotheses, in turn, are generalstatements about the target system of the scientific study, e.g.,expressing some general fact about all individuals in the population.Astatistical hypothesis is a general statement that canbe expressed by a probability distribution over sample space, i.e., itdetermines a probability for each of the possible samples.
Statistical methods provide the mathematical and conceptual meansto evaluate statistical hypotheses in the light of a sample. To thisaim they employ probability theory, and incidentally generalizationsthereof. The evaluations may determine how believable a hypothesis is,whether we may rely on the hypothesis in our decisions, how strong thesupport is that the sample gives to the hypothesis, and so on. Goodintroductions to statistics abound (e.g., Barnett 1999, Mood andGraybill 1974, Press 2002).
To set the stage an example, taken from Fisher (1935), will behelpful.
The tea tasting lady.
Consider a lady whoclaims that she can, by taste, determine the order in which milk andtea were poured into the cup. Now imagine that we prepare five cups oftea for her, tossing a fair coin to determine the order of milk andtea in each cup. We ask her to pronounce the order, and we find thatshe is correct in all cases! Now if she is guessing the order blindlythen, owing to the random way we prepare the cups, she will answercorrectly 50% of the time. This is our statistical hypothesis,referred to as the null hypothesis. It gives a probability of \(1/2\)to a correct guess and hence a probability of \(1/2\) to an incorrectone. The sample space consists of all strings of answers the ladymight give, i.e., all series of correct and incorrect guesses, but ouractual data sits in a rather special corner in this space. On theassumption of our statistical hypothesis, the probability of therecorded events is a mere 3%, or \(1/2^{5}\) more precisely. On thisground, we may decide to reject the hypothesis that the lady isguessing.
According to the so-callednull hypothesis test, such adecision is warranted if the data actually obtained are included in aparticular region within sample space, whose total probability doesnot exceed some specified limit, standardly set at 5%. Now considerwhat is achieved by the statistical test just outlined. We startedwith a hypothesis on the actual tea tasting abilities of the lady,namely, that she did not have any. On the assumption of thishypothesis, the sample data we obtained turned out to be surprisingor, more precisely, highly improbable. We therefore decided that thehypothesis that the lady has no tea tasting abilities whatsoever canbe rejected. The sample points us to a negative but general conclusionabout what the lady can, or cannot, do.
The basic pattern of a statistical analysis is thus familiar frominductive inference: we input the data obtained thus far, and thestatistical procedure outputs a verdict or evaluation that transcendsthe data, i.e, a statement that is not entailed by the data alone. Ifthe data are indeed considered to be the only input, and if thestatistical procedure is understood as an inference, then statisticsis concerned withampliative inference: roughly speaking, weget out more than we have put in. And since the ampliative inferencesof statistics pertain to future or general states of affairs, they areinductive. However, the association of statistics with ampliative andinductive inference is contested, both because statistics isconsidered to be non-inferential by some (seeSection 3) andnon-ampliative by others (seeSection 4).
Despite such disagreements, it is insightful to view statistics asa response to the problem of induction (cf. Howson 2000 and the entry on theproblem of induction). This problem, first discussed by Hume in hisTreatise of HumanNature (Book I, part 3, section 6) but prefigured already byancient sceptics like Sextus Empiricus (see the entry onancient skepticism), is that there is no proper justification forinferences that run from given experience to expectations about thefuture. Transposed to the context of statistics, it reads that thereis no proper justification for procedures that take data as input andthat return a verdict, an evaluation, or some other piece of advicethat pertains to the future, or to general states of affairs. Arguably,much of the philosophy of statistics is about coping with thischallenge, by providing a foundation of the procedures that statisticsoffers, or else by reinterpreting what statistics delivers so as toevade the challenge.
It is debatable that philosophers of statistics are ultimatelyconcerned with the delicate, even ethereal issue of the justificationof induction. In fact, many philosophers and scientists accept thefallibility of statistics, and find it more important that statisticalmethods are understood and applied correctly. As is so often the case,the fundamental philosophical problem serves as a catalyst: theproblem of induction guides our investigations into the workings, thecorrectness, and the conditions of applicability of statisticalmethods. The philosophy of statistics, understood as the generalheader under which these investigations are carried out, is thus notconcerned with ephemeral issues, but presents a vital and concretecontribution to the philosophy of science, and to science itself.
While there is large variation in how statistical procedures andinferences are organized, they all agree on the use of modernmeasure-theoretic probability theory (Kolmogorov ), or a nearkin, as the means to express hypotheses and relate them to data. Byitself, a probability function is simply a particular kind ofmathematical function, used to express the measure of a set(cf. Billingsley 1995).
Let \(W\) be a set with elements \(s\), and consider an initialcollection of subsets of \(W\), e.g., the singleton sets \(\{ s\}\). Now consider the operation of taking the complement \(\bar{R}\)of a given set \(R\): the complement \(\bar{R}\) contains exactly andall those \(s\) that are not included in \(R\). Next consider the join\(R \cup Q\) given sets \(R\) and \(Q\): an element \(s\) is a memberof \(R \cup Q\) precisely when it is a member of \(R\), \(Q\), orboth. The collection of sets generated by the operations of complementand join is called analgebra, denoted \(S\). In statistics we interpret \(S\) asthe set of samples, and we can associate sets \(R\) with specificevents or observations. A specific sample \(s\) includes a record ofthe event denoted with \(R\) exactly when \(s \in R\). We take thealgebra of sets like \(R\) as a language for making claims about thesamples.
Aprobability function is defined as an additivenormalized measure over the algebra: a function \[ P: {\cal S} \rightarrow [0, 1] \] such that \(P(R \cup Q) = P(R) + P(Q)\) if \(R \cap Q = \emptyset\)and \(P(W) = 1\). Theconditional probability \(P(Q \mid R)\)is defined as\[ P(Q \mid R) \; = \; \frac{P(Q \cap R)}{P(R)} , \]whenever \(P(R) > 0\). It determines the relative size of the set\(Q\) within the set \(R\). It is often read as the probability of theevent \(Q\)given that the event \(R\) occurs. Recall thatthe set \(R\) consists of all samples \(s\) that include a record ofthe event associated with \(R\). By looking at \(P(Q \mid R)\) we zoomin on the probability function within this set \(R\), i.e., weconsider the condition that the associated event occurs.
Now what does the probability function mean? The mathematicalnotion of probability does not provide an answer. The function \(P\)may be interpreted as
This distinction should not be confused with that betweenobjective and subjective probability. Both physical and epistemicprobability can be given an objective and subjective character, in thesense that both can be taken as dependent or independent of aknowing subject and her conceptual apparatus. For more details on theinterpretation of probability, the reader is invited to consultGalavotti (2005), Gillies (2000), Mellor (2005), von Plato (1994), theanthology by Eagle (2010), the handbook of Hajek and Hitchcock(forthcoming), or indeed the entry oninterpretations of probability.In this context the key point is that the interpretations can all beconnected to foundational programmes for statisticalprocedures. Although the match is not exact, the two major typesspecified above can be associated with the two major theories ofstatistics, classical and Bayesian statistics, respectively.
In the sciences, the idea that probabilities express physical statesof affairs, often called chances or stochastic processes, is mostprominent. They are relativefrequencies in series of eventsor, alternatively, they are tendencies orpropensities in thesystems that realize those events. More precisely, the probabilityattached to the property of an event type can be understood as thefrequency or tendency with which that property manifests in a seriesof events of that type. For instance, the probability of a coinlanding heads is a half exactly when in a series of similar cointosses, the coin lands heads half the time. Or alternatively, theprobability is half if there is an even tendency towards both possibleoutcomes in the setup of the coin tossing. The mathematician Venn(1888) and scientists like Quetelet and Maxwell (cf. von Plato 1994)are early proponents of this way of viewing probability. Philosophicaltheories of propensities were first coined by Peirce (1910), anddeveloped by Popper (1959), Mellor (1971), Bigelow (1977), and Giere(1976); see Handfield (2012) for a recent overview. A rigourous theoryof probability as frequency was first devised by von Mises (1981),also defended by Reichenbach (1938) and beautifully expounded invan Lambalgen (1987).
The notion of physical probability is connected to one of the majortheories of statistical method, which has come to be calledclassical statistics. It was developed roughly in the firsthalf of the 20th century, mostly by mathematicians and workingscientists like Fisher (1925, 1935, 1956), Wald (1939, 1950), Neymanand Pearson (1928, 1933, 1967), and refined by very many classicalstatisticians of the last few decades. The key characteristic of thistheory of statistics aligns naturally with viewing probabilities asphysical chances, hence pertaining to observable and repeatableevents. Physical probability cannot meaningfully be attributed tostatistical hypotheses, since hypotheses do not have tendencies orfrequencies with which they come about: they are categorically true orfalse, once and for all. Attributing probability to a hypothesis seems to entail thatthe probability is read epistemically.
Classical statistics is often calledfrequentist, owing tothe centrality of frequencies of events in classical procedures andthe prominence of the frequentist interpretation of probabilitydeveloped by von Mises. In this interpretation, chances arefrequencies, or proportions in a class of similar events oritems. They are best thought of as analogous to other physicalquantities, like mass and energy. It deserves emphasis thatfrequencies are thus conceptually prior to chances . In propensitytheory the probability of an individual event or item is viewed as atendency in nature, so that the frequencies, or the proportions in aclass of similar events or items, manifest as a consequence of the lawof large numbers. In the frequentist theory, by contrast, theproportions lay down, indeed define what the chances are. Thisleads to a central problem for frequentist probability, theso-calledreference class problem: it is not clear whatclass to associate with an individual event or item (cf. Reichenbach1949, Hajek 2007). One may argue that the class needs to be as narrowas it can be, but in the extreme case of a singleton class of events,the chances of course trivialize to zero or one. Since classicalstatistics employs non-trivial probabilities that attach to the singlecase in its procedures, a fully frequentists understanding ofstatistics is arguably in need of a response to the reference classproblem.
To illustrate physical probability, we briefly consider physical probability in theexample of the tea tasting lady.
Physical probability
We denote the nullhypothesis that the lady is merely guessing by \(h\). Say that wefollow the rule indicated in the example above: we reject this nullhypothesis, i.e., denying that the lady is merely guessing, wheneverthe sampled data \(s\) is included in a particular set \(R\) ofpossible samples, so \(s \in R\), and that \(R\) has a summedprobability of 5% according to the null hypothesis. Now imagine thatwe are supposed to judge a whole population of tea tasting ladies,scattered in tea rooms throughout the country. Then, by running theexperiment and adopting the rule just cited, we know that we willfalsely attribute special tea tasting talents to 5% of those ladiesfor whom the null hypothesis is true, i.e., who are in fact merelyguessing. In other words, this percentage pertains to the physicalprobability of a particular set of events, which by the rule isconnected to a particular error in our judgment.
Now say that we have found a lady for whom we reject the nullhypothesis, i.e., a lady who passes the test. Does she have the teatasting ability or not? Unfortunately this is not the sort of questionthat can be answered by the test at hand. A good answer wouldpresumably involve the proportion of ladies who indeed have thespecial tea tasting ability among those whose scores exceeded acertain threshold, i.e., those who answered correctly on all fivecups. But this latter proportion, namely of ladies for whom the nullhypothesis is false among all those ladies who passed the test,differs from the proportion of ladies who passed the test amongthose ladies for whom it is false. It will depend also on theproportion of ladies who have the ability in the population underscrutiny. The test, by contrast, only involves proportions within agroup of ladies for whom the null hypothesis is true: we can onlyconsider probabilities for particular events on the assumption thatthe events are distributed in a given way.
There is an alternative way of viewing the probabilities that appearin statistical methods: they can be seen as expressions of epistemicattitudes. We are again facing several interrelated options. Veryroughly speaking, epistemic probabilities can be doxastic,decision-theoretic, or logical.
Probabilities may be taken to representdoxastic attitudesin the sense that they specify opinions about data and hypotheses ofan idealized rational agent. The probability then expresses the strengthor degree of belief, for instance regarding the correctness of thenext guess of the tea tasting lady. They may also be taken asdecision-theoretic, i.e., as part of a more elaboraterepresentation of the agent, which determines her dispositions towardsdecisions and actions about the data and the hypotheses. Oftentimes adecision-theoretic representation involves doxastic attitudesalongside preferential and perhaps other ones. In that case, theprobability may for instance express a willingness to bet on the ladybeing correct. Finally, the probabilities may be taken aslogical. More precisely, a probabilistic model may betaken as a logic, i.e., a formal representation that fixes a normativeideal for uncertain reasoning. According to this latter option,probability values over data and hypotheses have a role that iscomparable to the role of truth values in deductive logic: they serveto secure a notion of valid inference, without carrying the suggestionthat the numerical values refer to anything psychologicallysalient.
The epistemic view on probability came into development in the 19thand the first half of the 20th century, first by the hand of De Morgan(1847) and Boole (1854), later by Keynes (1921), Ramsey (1926) and deFinetti (1937), and by decision theorists, philosophers andinductive logicians such as Carnap (1950), Savage (1962), Levi (1980),and Jeffrey (1992). Important proponents of these views in statisticswere Jeffreys (1961), Edwards (1972), Lindley (1965), Good (1983),Jaynes (2003) as well as very many Bayesian philosophers andstatisticians of the last few decades (e.g., Goldstein 2006, Kadane2011, Berger 2006, Dawid 2004). All of these have a view that placesprobabilities somewhere in the realm of the epistemic rather than thephysical, i.e., not as part of a model of the world but rather as ameans to model a representing system like the human mind.
The above division is certainly not complete and it is blurry atthe edges. For one, the doxastic notion of probability has mostly beenspelled out in a behaviorist manner, with the help of decision theory.Many have adopted so-called Dutch book arguments to make thedegree of belief precise, and to show that it is indeed captured bythe mathematical theory of probability (cf. Jeffrey 1992). Accordingto such arguments, the degree of belief in the occurrence of an eventis given by the price of a betting contract that pays out one monetaryunit if the event manifests. However, there are alternatives to thisbehaviorist take on probability as doxastic attitude, using accuracyor proximity to the truth. Most of these are versions or extensions ofthe arguments proposed by de Finetti (1974). Others havedeveloped an axiomatic approach based on natural desiderata fordegrees of belief (e.g., Cox 1961).
Furthermore, and as alluded to above, within the doxasticconception of probability we can make a further subdivisioninto subjective and objective doxastic attitudes. The definingcharacteristic of an objective doxastic probability is that it isconstrained by the demand that the beliefs are calibrated to someobjective fact or state of affairs, or else by further rationalitycriteria. A subjective doxastic attitude, by contrast, is notconstrained in such a way: from a normative perspective, agents arefree to believe as they see fit, as long as they comply to theprobability axioms.
For present concerns the important point is that each of theseepistemic interpretations of the probability calculus comes with itsown set of foundational programs for statistics. On the whole,epistemic probability is most naturally associated withBayesianstatistics, the second major theory of statistical methods (Press2002, Berger 2006, Gelman et al 2013). The keycharacteristic of Bayesian statistics flows directly from theepistemic interpretation: under this interpretation it becomespossible to assign probability to a statistical hypothesis and torelate this probability, understood as an expression of how stronglywe believe the hypothesis, to the probabilities of events. Bayesianstatistics allows us to express how our epistemic attitudes towards astatistical hypothesis, be it logical, decision-theoretic, ordoxastic, changes under the impact of data.
To illustrate the epistemic conception of probability in Bayesianstatistics, we briefly return to the example of the tea tastinglady.
Epistemic probability
As before wedenote the null hypothesis that the lady is guessing randomly with\(h\), so that the distribution \(P_{h}\) gives a probability of 1/2to any guess made by the lady. The alternative \(h'\) is that the ladyperforms better than a fair coin. More precisely, we might stipulatethat the distribution \(P_{h'}\) gives a probability of 3/4 to acorrect guess. At the outset we might find it rather improbable thatthe tea tasting lady has special tea tasting abilities. To expressthis we give the hypothesis of her having these abilities only halfthe probability of her not having the abilities: \(P(h') = 1/3\) and\(P(h) = 2/3\). Now, leaving the mathematical details toSection 4.1, after receiving the data thatshe guessed all five cups correctly, our new belief in the lady'sspecial abilities has more than reversed. We now think it roughly fourtimes more probable that the lady has the special abilities than thatshe is merely a random guesser: \(P(h') = 243/307 \approx 4/5\) and\(P(h') \approx 1/5\).
The take-home message is that the Bayesian method allows us toexpress our epistemic attitudes to statistical hypotheses in terms ofa probability assignment, and that the data impact on this epistemicattitude in a regulated fashion.
It should be emphasized that Bayesian statistics is not the soleuser of an epistemic notion of probability. Indeed, a frequentistsunderstanding of probabilities assigned to statistical hypothesesseems nonsensical. But it is perfectly possible to read theprobabilities of events, or elements in sample space, as epistemic,quite independently of the statistical method that is being used. Asfurther explained in the next section, several philosophicaldevelopments of classical statistics employ epistemic probability,most notably fiducial probability (Fisher 1955 and 1956; see alsoSeidenfeld 1992 and Zabell 1992), likelihoodism (Hacking 1965, Edwards1972, Royall 1997), and evidential probability (Kyburg 1961), orconnect the procedures of classical statistics to inference andsupport in some other way. In all these developments, probabilitiesand functions over sample space are read epistemically, i.e., asexpressions of the strength of evidence, the degree of support, orsimilar.
The collection of procedures that may be grouped under classicalstatistics is vast and multi-faceted. By and large, classicalstatistical procedures share the feature that they only rely onprobability assignments over sample spaces. As indicated, an importantmotivation for this is that those probabilities can be interpreted asfrequencies, from which the term offrequentiststatistics originates. Classical statistical procedures aretypically defined by some function over sample space, where thisfunction depends, often exclusively, on the distributions that thehypotheses under consideration assign to the sample space. For therange of samples that may be obtained, the function then points to oneof the hypotheses, or perhaps to a set of them, as being in some sensethe best fit with that sample. Or, conversely, it discards candidatehypotheses that render the sample too improbable.
In sum, classical procedures employ the data to narrow down a setof hypotheses. Put in such general terms, it becomes apparent thatclassical procedures provide a response to the problem ofinduction. The data are used to get from a weak general statementabout the target system to a stronger one, namely from a set ofcandidate hypotheses to a subset of them. The central concern in thephilosophy of statistics is how we are to understand these procedures,and how we might justify them. Notice that the pattern of classicalstatistics resembles that ofeliminative induction: in viewof the data we discard some of the candidate hypotheses. Indeedclassical statistics is often seen in loose association with Popper'sfalsificationism, but this association is somewhat misleading. Inclassical procedures statistical hypotheses are discarded when theyrender the observed sample too improbable, which of course differsfrom discarding hypotheses that deem the observed sampleimpossible.
The foregoing already provided a short example and a rough sketchof classical statistical procedures. These are now specified in moredetail, on the basis of Barnett (1999) as primary source. Thefollowing focuses on two very central procedures, hypothesis testingand estimation. The first has to do with the comparison of twostatistical hypotheses, and invokes theory developed by Neyman andPearson. The second concerns the choice of a hypothesis from a set,and employs procedures devised by Fisher. While these figures arerightly associated with classical statistics, their philosophicalviews diverge. We return to this below.
The procedure of Fisher's null hypothesis test was alreadydiscussed briefly in the foregoing. Let \(h\) be the hypothesis ofinterest and, for the sake of simplicity, let \(S\) be a finite samplespace. The hypothesis \(h\) imposes a distribution over the samplespace, denoted \(P_{h}\). Every point \(s\) in the space represents apossible sample of data. We now define a function \(F\) on the samplespace that identifies when we will reject the null hypothesis bymarking the samples \(s\) that lead to rejection with \(F(s) = 1\), asfollows: \[ F(s) = \begin{cases} 1 \quad \text{if } P_{h}(s) < r,\\ 0 \quad\text{otherwise.} \end{cases} \]Notice that the definition of the region of rejection, \(R_{r} = \{s:\: F(s) = 1 \}\), hinges on the probability of the data under theassumption of the hypothesis, \(P_{h}(s)\). This expression is oftencalled thelikelihood of the hypothesis on the sample \(s\). We can setthe threshold \(r\) for the likelihood to a suitable value, such thatthe total probability of the region of rejection \(R_{r}\) is below agiven level of error, for example, \(P_{h}(R) < 0.05\).
It soon appeared that comparisons between two rival hypotheses werefar more informative, in particular because little can be said abouterror rates if the null hypothesis is in fact false. Neyman andPearson (1928, 1933, and 1967) devised the so-calledlikelihoodratio test, a test that compares the likelihoods of two rivalinghypotheses. Let \(h\) and \(h'\) be the null and the alternativehypothesis respectively. We can compare these hypotheses by thefollowing test function \(F\) over the sample space:\[ F(s) = \begin{cases} 1 \quad \text{if } \frac{P_{h'}(s)}{P_{h}(s)}> r,\\ 0 \quad \text{otherwise,} \end{cases} \]where \(P_{h}\) and \(P_{h'}\) are the probability distributions overthe sample space determined by the statistical hypotheses \(h\) and\(h'\) respectively. If \(F(s) = 1\) we decide to reject the nullhypothesis \(h\), else we accept \(h\) for the time being and sodisregard \(h'\).
The decision to accept or reject a hypothesis is associated withthe so-called significance and power of the test. Thesignificance is the probability, according to the nullhypothesis \(h\), of obtaining data that leads us to falsely rejectthis hypothesis \(h\):\[ \text{Significance}_{F} = \alpha = P_{h}(R_{r}) = \sum_{s \in S}F(s) P_{h}(s) d s , \]The probability \(\alpha\) is alternatively called thetype-Ierror, and it is often denoted as thesignificance or thep-value. Thepower isthe probability, according to the alternative hypothesis \(h'\), ofobtaining data that leads us to correctly reject the null hypothesis\(h\):\[ \text{Power}_{F} = 1 - \beta = P_{h'}(F_{1}) = \sum_{s \in S} F(s)P_{h'}(s) d s. \]The probability \(\beta\) is called thetype-II error offalsely accepting the null hypothesis. An optimal test is one thatminimizes both the errors \(\alpha\) and \(\beta\). In theirfundamental lemma, Neyman and Pearson proved that the decision hasoptimal significance and power for, and only for, likelihood-ratiotest functions \(F\). That is, an optimal test depends only on athreshold for the ratio \(P_{h'}(s) / P_{h}(s)\).
The example of the tea tasting lady allows for an easy illustrationof the likelihood ratio test.
Neyman-Pearson test
Next to thenull hypothesis \(h\) that the lady is randomly guessing, we nowconsider the alternative hypothesis \(h'\) that she has a chance of\(3/4\) to guess the order of tea and milk correctly. The samples\(s\) are binary 5-tuples that record guesses as correct andincorrect. To determine the likelihoods of the two hypotheses, andthereby the value of the test function for each sample, we only needto know the so-calledsufficient statistic, in this case thenumber of correct guesses \(n\) independently of the order. Denoting aparticular sequence of guesses in which the lady has \(n\) correctguesses out of \(t\) with \(s_{n/t}\), we have \(P_{h}(s_{n/5}) =1/2^{5}\) and \(P_{h'}(s_{n/5}) = 3^{n} / 4^{5}\), so that thelikelihood ratio becomes \(3^{n} / 2^{5}\). If we require that thesignificance is lower than 5%, then it can be calculated that only thesamples with \(n = 5\) may be included in the region ofrejection. Accordingly we may set the cut-off point \(r\) such that\(r \geq 3^{4} / 2^{5}\) and \(r \lt 3^{5} / 2^{5}\), e.g., \(r =3^{4} / 2^{5}\).
The threshold of 5% significance is part of statistical conventionand very often fixed before even considering the power. Notice thatthe statistical procedure associates expected error rates with adecision to reject or accept. Especially Neyman has become known forinterpreting this in a strictly behaviourist fashion. For furtherdiscussion on this point, please seeSection 3.2.2.
In this section we briefly consider parameter estimation by maximumlikelihood, as first devised by Fisher (1956). While in the foregoingwe used a finite sample space, we now employ a space with infinitelymany possible samples. Accordingly, a probability distribution oversample space is written down in terms of a so-calleddensityfunction, denoted \(P(s) ds\), which technically speakingexpresses the infinitely small probability assigned to an infinitelysmall patch \(ds\) around the point \(s\). This probability densityworks much like an ordinary probability function.
Maximum likelihood estimation, or MLE for short, is a tool fordetermining the best among a set of hypotheses, often called astatistical model. Let \(M = \{h_{\theta} :\: \theta \in\Theta \}\) be the model, labeled by the parameter \(\theta\), let\(S\) be the sample space, and \(P_{\theta}\) the distributionassociated with \(h_{\theta}\). Then define themaximumlikelihood estimator \(\hat{\theta}\) as a function over thesample space:\[ \hat{\theta}(s) = \left\{ \theta :\: \forall h_{\theta'}\bigl(P_{\theta'}(s)ds \leq P_{\theta}(s)ds \bigr) \right\}. \]So the estimator is a set, typically a singleton, of values of\(\theta\) for which the likelihood of \(h_{\theta}\) on the data\(s\) is maximal. The associated best hypothesis we denote with\(h_{\hat{\theta}}\). This can again be illustrated for the teatasting lady.
Maximum likelihood estimation
A naturalstatistical model for the case of the tea tasting lady consists ofhypotheses \(h_{\theta}\) for all possible levels of accuracy that thelady may have, \(\theta \in [0, 1]\). Now the number of correctguesses \(n\) and the total number of guesses \(t\) are the sufficientstatistics: the probability of a sample only depends on thosenumbers. For any particular sequence \(s_{n/t}\) of \(t\) guesses with\(n\) successes, the associated likelihoods of \(h_{\theta}\) are\[ P_{\theta}(s_{n/t}) = \theta^{n} (1 - \theta)^{t - n} . \]For any number of trials \(t\) the maximum likelihood estimator thenbecomes \(\hat{\theta} = n / t\).
We suppose that the number of cups served to the lady is fixed at\(t\) so that sample space is finite again. Notice, finally, that\(\hat{\theta}\) is the hypothesis that makes the data most probableand not the hypothesis that is most probable in the light of thedata.
There are several requirements that we might impose on an estimatorfunction. One is that the estimator must be consistent. This meansthat for larger samples the estimator function \(\hat{\theta}\)converges to the parameter values associated with the distribution\(\theta^{\star}\) of the data generating system, or the trueparameter values for short. Another requirement is that the estimatormust be unbiased, meaning that there is no discrepancy between theexpected value of the estimator and the true parameter values. The MLEprocedure is certainly not the only one used for estimating the valueof a parameter of interest on the basis of statistical data. A simplertechnique is the minimization of a particular target function, e.g.,the minimizing the sum of the squares of the distances between theprediction of the statistical hypothesis and the data points, alsoknown as themethod of least squares. A more generalperspective, first developed by Wald (1950), is provided by measuringthe discrepancy between the predictions of the hypothesis and theactual data in terms of a loss function. The summed squares and thelikelihoods may be taken as expressions of this loss.
Often the estimation is coupled to a so-calledconfidenceinterval (cf. Cumming 2012). For ease of exposition, assume that\(\Theta\) consists of the real numbers and that every sample \(s\) islabelled with a unique \(\hat{\theta}(s)\). We define the set\(R_{\tau} = \{ s:\: \hat{\theta}(s) = \tau \}\), the set of samplesfor which the estimator function has the value \(\tau\). We can nowcollate a region in sample space within which the estimator function\(\hat{\theta}\) is not too far off the mark, i.e., not too far fromthe true value \(\theta^{\star}\) of the parameter. For example,\[ C^{\star}_{\Delta} = \{ R_{\tau} :\: \tau \in [ \theta^{\star} -\Delta , \theta^{\star} + \Delta ] \} . \]So this set is the union of all \(R_{\tau}\) for which \(\tau \in [\theta^{\star} - \Delta , \theta^{\star} + \Delta ]\). Now we mightset this region in such a way that it covers a large portion of thesample space, say \(1 - \alpha\), as measured by the true distribution\(P_{\theta^{\star}}\). We choose \(\Delta\) such that\[ P_{\theta^{\star}}(C^{\star}_{\Delta}) = \int_{\theta^{\star} -\Delta}^{\theta^{\star} + \Delta} P_{\theta^{\star}}(R_{\tau}) d\tau =1 - \alpha .\]Statistical folk lore typically sets \(\alpha\) at a value5%. Relative to this number, the size of \(\Delta\) says somethingabout the quality of the estimate. If we were to repeat the collectionof the sample over and over, we would find the estimator\(\hat{\theta}\) within a range \(\Delta\) of the true value\(\theta^{\star}\) in 95% of all samples. This leads us to define thesymmetric 95% confidence interval:\[ CI_{95} = [ \hat{\theta} - \Delta , \hat{\theta} + \Delta ] \]The interpretation is the same as in the foregoing: with repeatedsampling we find the true value within \(\Delta\) of the estimate in95% of all samples.
It is crucial that we can provide an unproblematic frequentistinterpretation of the event that \(\hat{\theta} \in [\theta^{\star} -\Delta, \theta^{\star} + \Delta]\), under the assumption of the truedistribution. In a series of estimations, the fraction of times inwhich the estimator \(\hat{\theta}\) is further away from\(\theta^{\star}\) than \(\Delta\), and hence outside this interval,will tend to 5%. The smaller this region, the more reliable theestimate. Note that this interval is defined in terms of the unknowntrue value \(\theta^{\star}\). However, especially if the size of theinterval \(2 \Delta\) is independent of the true parameter\(\theta^{\star}\), it is tempting to associate the 95% confidenceinterval with the frequency with which the true value lies within arange of \(\Delta\) around the estimate \(\hat{\theta}\). Below wecome back to this interpretation.
There are of course many more procedures for estimating a varietyof statistical targets, and there are many more expressions for thequality of the estimation (e.g., bootstrapping, see Efron andTibshirani 1993). Theories of estimation often come equipped with arich catalogue of situation-specific criteria for estimators,reflecting the epistemic and pragmatic goals that the estimator helpsachieving. However, in itself the estimator functions donot present guidelines for belief and, importantly,confidence intervals do not either.
Classical statistics is widely discussed in the philosophy ofstatistics. In what follows two problems with the classical approachare outlined, to wit, its problematic interface with belief and thefact that it violates the so-called likelihood principle. Many morespecific problems can be seen to derive from these generalones.
Consider the likelihood ratio test of Neyman and Pearson. Asindicated, the significance or p-value of a test is an error rate thatwill manifest if data collection and testing is repeated, assumingthat the null hypothesis is in fact true. Notably, the p-value doesnot tell us anything about how probable the truth of the nullhypothesis is. However, many scientists do use hypothesis testing inthis manner, and there is much debate over what can and cannot bederived from a p-value (cf. Berger and Sellke 1987, Casella andBerger 1987, Cohen 1994, Harlow et al 1997, Wagenmakers 2007,Ziliak and McCloskey 2008, Spanos 2007, Greco 2011, Sprengerforthcoming-a). After all, the test leads to the advice to eitherreject the hypothesis or accept it, and this seems conceptually veryclose to giving a verdict of truth or falsity.
While the evidential value ofp-values is much debated,many admit that the probability of data according to a hypothesiscannot be used straightforwardly as an indication of how believablethe hypothesis is (cf. Gillies 1971, Spielman 1974 and 1978). Suchusage runs into the so-calledbase-rate fallacy. The exampleof the tea tasting lady is again instructive.
Base-rate fallacy
Imagine that we travel thecountry to perform the tea tasting test with a large number of ladies,and that we find a particular lady who guesses all five cupscorrectly. Should we conclude that the lady has a special talent fortasting tea? The problem is that this depends on how many ladiesamong those tested actually have the special talent. If the ability isvery rare, it is more attractive to put the five correct guesses downto a chance occurrence. By comparison, imagine that all the ladiesenter the lottery. In analogy to a lady guessing all cups correctly,consider a lady who wins one of the lottery's prizes. Of coursewinning a prize is very improbable, unless one is in cahoots with thebookmaker, i.e., the analogon of having a special tea tastingability. But surely if a lady wins the lottery, this is not a goodreason to conclude that she must have committed fraud and call for herarrest. Similarly, if a lady has guessed all cups correctly, we cannotsimply conclude that she has special abilities.
Essentially the same problem occurs if we consider the estimationsof a parameter as direct advice on what to believe, as made clear byan example of Good (1983, p. 57) that is presented here in the teatasting context. After observing five correct guesses, we have\(\hat{\theta} = 1\) as maximum likelihood estimator. But it is hardlybelievable that the lady will in the long run be 100% accurate. Thepoint that estimation and belief maintain complicated relations isalso put forward in discussions ofLindley's paradox (Lindley1957, Spanos 2013, Sprenger forthcoming-b). In short, it seemswrongheaded to turn the results of classical statistical proceduresinto beliefs.
It is a matter of debate whether any of this can be blamed onclassical statistics. Initially, Neyman was emphatic that theirprocedures could not be taken as inferences, or as in some other waypertaining to the epistemic status of the hypotheses. Their ownstatistical philosophy was strictly behaviorist (cf. Neyman 1957), andit may be argued that the problems disappear if only scientistsabandon their faulty epistemic use of classical statistics. Asexplained in the foregoing, we can uncontroversially associate errorrates with classical procedures, and so with the decisions that flowfrom these procedures. Hence, a behavioural and error-basedunderstanding of classical statistics seems just fine. However, bothstatisticians and philosophers have argued that an epistemic readingof classical statistics is possible, and in fact preferable (e.g.,Fisher 1955, Royall 1997). Accordingly, many have attempted toreinterpret or develop the theory, in order to align it with theepistemically oriented statistical practice of scientists (see Mayo1996, Mayo and Spanos 2011, Spanos 2013b).
Hypothesis tests and estimations are sometimes criticised becausetheir results generally depend on the probability functions over theentire sample space, and not exclusively on the probabilities of theobserved sample. That is, the decision to accept or reject the nullhypothesis depends not just on the probability of what has actuallybeen observed according to the various hypotheses, but also on theprobability assignments over events that could have been observed butwere not. A well-known illustration of this problem concerns so-calledoptional stopping (Robbins 1952, Roberts 1967, Kadane et al1996, Mayo 1996, Howson and Urbach 2006).
Optional stopping is here illustrated for the likelihood ratio testof Neyman and Pearson but a similar story can be run for Fisher's nullhypothesis test and for the determination of estimators and confidenceintervals.
Optional stopping
Imagine tworesearchers who are both testing the same lady on her ability todetermine the order in which milk and tea were poured in her cup. Theyboth entertain the null hypothesis that she is guessing at random,with a probability of \(1/2\), against the alternative of her guessingcorrectly with a probability of \(3/4\). The more diligent researcherof the two decides to record six trials. The more impatient, on theother hand researcher records at most six trials, but decides to stoprecording the first trial that the lady guesses incorrectly. Nowimagine that, in actual fact, the lady guesses all but the last of thecups correctly. Both researchers then have the exact same data of fivesuccesses and one failure, and the likelihoods for these data are thesame for the two researchers too. However, while the diligentresearcher cannot reject the null hypothesis, the impatient researchercan.
This might strike us as peculiar: statistics should tell us theobjective impact that the data have on a hypothesis, but here theimpact seems to depend on thesampling plan of the researcherand not just on the data themselves. As further explained inSection 3.2.3, the results of the two researchers differ because ofdifferences in how samples that were not observed are factored intothe procedure.
Some will find this dependence unacceptable: the intentions andplans of the researcher are irrelevant to the evidential value of thedata. But others argue that it is just right. They maintain that theimpact of data on the hypotheses should depend on thestoppingrule or protocol that is followed in obtaining it, and not onlyon the likelihoods that the hypotheses have for those data(e.g. Mayo 1996). The motivating intuition is that upholding theirrelevance of the stopping rule makes it impossible to banopportunistic choices in data collection. In fact, defenders ofclassical statistics turn the table on those who maintain thatoptional stopping is irrelevant. They submit that it opens up thepossibility of reasoning to a foregone conclusion by, for example,persistent experimentation: we might decide to ceaseexperimentation only if the preferred result is reached. However, asshown in Kadaneet al. (1996) and further discussed in Steele(2012), persistent experimentation is not guaranteed to beeffective, as long as we make sure to use the correct, in this caseBayesian, procedures.
The debate over optional stopping is eventually concerned with theappropriate evidential impact of data. A central concern in this widerdebate is the so-calledlikelihood principle (see Hacking1965 and Edwards 1972). This principle has it that the likelihoods ofhypotheses for the observed data completely fix the evidential impactof those data on the hypotheses. In the formulation of Berger andWolpert (1984), the likelihood principle states that two samples \(s\)and \(s'\) are evidentially equivalent exactly when \(P_{i}(s) =kP_{i}(s')\) for all hypotheses \(h_{i}\) under consideration, givensome constant \(k\). Famously, Birnbaum (1962) offers a proof of theprinciple from more basic assumptions. This proof relies on theassumption ofconditionality. Say that we first toss a coin,find that it lands heads, then do the experiment associated with thisoutcome, to record the sample \(s\). Compare this to the case where wedo the experiment and find \(s\) directly, without randomly pickingit. The conditionality principle states that this second sample hasthe same evidential impact as the first one: what we could have found,but did not find, has no impact on the evidential value of thesample. Recently, Mayo (2010) has taken issue with Birnbaum'sderivation of the likelihood principle.
The classical view sketched above entails a violation of this: theimpact of the observed data may be different depending on theprobability of other samples than the observed one, because thoseother samples come into play when determining regions of acceptanceand rejection. The Bayesian procedures discussed inSection 4,on the other hand, uphold the likelihood principle: in determining theposterior distribution over hypotheses only the prior and thelikelihood of the observed data matter. In the debate over optionalstopping and in many of the other debates between classical andBayesian statistics, the likelihood principle is the focal point.
The view that the data reveal more, or something else,than what is expressed by the likelihoods of the hypotheses atissue merits detailed attention. Here we investigate this issuefurther with reference to the controversy over optional stopping.
Let us consider the analyses of the two above researchers in somenumerical detail by constructing the regions of rejection for both ofthem.
Determining regions of rejection
Thediligent researcher considers all 6-tuples of success andfailure as the sample space, and takes their numbers as sufficientstatistic. The event of six successes, or six correct guesses, has aprobability of \(1 / 2^{6} = 1/64\) under the null hypothesis that thelady is merely guessing, against a probability of \(3^{6} / 4^{6}\)under the alternative hypothesis. If we set \(r < 3^{6} / 2^{6}\),then this sample is included in the region of rejection of the nullhypothesis. Samples with five successes have a probability of \(1/64\)under the null hypothesis too, against a probability of \(3^5 /4^{6}\) under the alternative. By lowering the likelihood ratio by afactor 3, we include all these samples in the region of rejection. Butthis will lead to a total probability of false rejection of \(7/64\),which is larger than 5%. So these samples cannot be included in theregion of rejection, and hence the diligent researcher does not rejectthe null hypothesis upon finding five successes and one failure.For theimpatient researcher, on the other hand, the samplespace is much smaller. Apart from the sample consisting of sixsuccesses, all samples consist of a series of successes ending with afailure, differing only in the length of the series. Yet theprobabilities over the two samples of length six are the same as forthe diligent researcher. As before, the sample of six successes isagain included in the region of rejection. Similarly, the sequence offive successes followed by one failure also has a probability of\(1/64\) under the null hypothesis, against a probability of \(3^5 /4^{6}\) according to the alternative. The difference is that loweringthe likelihood ratio to include this sample in the region of rejectionleads to the inclusion of this sample only. And if we include it inthe region of rejection, the probability of false rejection becomes\(1/32\) and hence does not exceed 5%. Consequently, on the basis ofthese data the laid-back researcher can reject the null hypothesisthat the lady is merely guessing.
It is instructive to consider why exactly the impatient researchercan reject the null hypothesis. In virtue of his sampling plan, theother samples with five successes, namely the ones which kept thediligent researcher from including the observed sample in the regionof rejection on pain of exceeding the error probability, could nothave been observed. This exemplifies that the results of a classicalstatistical procedure do not only depend on the likelihoods for theactual data, which are indeed the same for both researchers. They alsodepend on the likelihoods for data that we did not obtain.
In the above example, it may be considered confusing that theprotocol used for optional stopping depends on the data that is beingrecorded. But the controversy over optional stopping also emerges ifthis dependence is absent. For example, imagine a third researcher whosamples until the diligent researcher is done, or before that if shestarts to feel peckish. Furthermore we may suppose that with each newcup offered to the lady, the probability of feeling peckish is\(\frac{1}{2}\). This peckish researcher will also be able to rejectthe null hypothesis if she completes the series of six cups. And itcertainly seems at variance with the objectivity of the statisticalprocedure that this rejection depends on the physiology and the stateof mind of the researcher: if she had not kept open the possibility ofa snack break, she would not have rejected the null hypothesis, eventhough she did not actually take that break. As Jeffrey famouslyquipped, this is indeed a “remarkable procedure”.
Yet the case is not as clear-cut as it may seem. For one, thepeckish researcher is arguably testing two hypotheses in tandem, oneabout the ability of the tea tasting lady and another about her ownpeckishness. Together the combined hypotheses have a differentlikelihood for the actual sample than the simple hypothesis consideredby the diligent researcher. The likelihood principle given abovedictates that this difference does not affect the evidential impact ofthe actual sample, but some retain the intuition that itshould. Moreover, in some cases this intuition is shared by those whouphold the likelihood principle, namely when the stopping rule dependson the process being recorded in a way not already expressed by thehypotheses at issue (cf. Robbins 1952, Howson and Urbach 2006,p. 365). In terms of our example, if the lady is merely guessing, thenit may be more probable that the researcher gets peckish out of sheerboredom, than if the lady performs far below or above chance level. Insuch a case the act of stopping itself reveals something about thehypotheses at issue, and this should be reflected in the likelihoodsof the hypotheses. This would make the evidential impact that the datahave on the hypothesis dependent on the stopping rule after all.
There have been numerous responses to the above criticisms. Some ofthose responses effectively reinterpret the classical statisticalprocedures as pertaining only to the evidential impact of data. Otherresponses develop the classical statistical theory to accommodatethe problems. Their common core is that they establish or at leastclarify the connection between two conceptual realms: thestatistical procedures refer to physical probabilities, while theirresults pertain to evidence and support, and even to the rejection oracceptance of hypotheses.
Classical statistics is often presented as providing us withadvice for actions. The error probabilities do not tell us whatepistemic attitude to take on the basis of statistical procedures,rather they indicate the long-run frequency of error if we live bythem. Specifically Neyman advocated this interpretation of classicalprocedures. Against this, Fisher (1935a, 1955), Pearson, andother classical statisticians have argued for more epistemicinterpretations, and many more recent authors have followed suit.
Central to the above discussion on classical statistics is theconcept of likelihood, which reflects how the data bears on thehypotheses at issue. In the works of Hacking (1965), Edwards (1972),and more recently Royall (1997), the likelihoods are taken as acornerstone for statistical procedures and given an epistemicinterpretation. They are said to express the strength of the evidencepresented by the data, or the comparative degree of support that thedata give to a hypothesis. Hacking formulates this idea in theso-calledlaw of likelihood (1965, p. 59): if the sample\(s\) is more probable on the condition of \(h_{0}\) than on\(h_{1}\), then \(s\) supports \(h_{0}\) more than it supports\(h_{1}\).
The position of likelihoodism is based on a specific combination ofviews on probability. On the one hand, it only employs probabilitiesover sample space, and avoids putting probabilities over statisticalhypotheses. It thereby avoids the use of probability that cannot begiven a physical interpretation. On the other hand, it does interpretthe probabilities over sample space as components of a supportrelation, and thereby as pertaining to the epistemic rather than thephysical realm. Notably, the likelihoodist approach fits well with along history in formal approaches to epistemology, in particular withconfirmation theory (see Fitelson 2007), in which the probabilitytheory is used to spell out confirmation relations between dataand hypotheses. Measures of confirmation invariably take thelikelihoods of hypotheses as input components. They provide aquantitative expression of the support relations described by the lawof likelihood.
Another epistemic approach to classical statistics is presented byMayo (1996) and Mayo and Spanos (2011). Over the past decade or so,they have done much to push the agenda of classical statistics in thephilosophy of science, which had become dominated by Bayesianstatistics. Countering the original behaviourist tendencies of Neyman,theerror statistical approach advances an epistemic readingof classical test and estimation procedures. Mayo and Spanos arguethat classical procedures are best understood as inferential: theylicense inductive inferences. But they readily admit that theinferences are defeasible, i.e., they could lead usastray. Classical procedures are always associated with particularerror probabilities, e.g., the probability of a false rejection oracceptance, or the probability of an estimator falling within acertain range. In the theory of Mayo and Spanos, these errorprobabilities obtain an epistemic role, because they are taken toindicate the reliability of the inferences licensed by theprocedures.
The error statistical approach of Mayo and others comprises ageneral philosophy of science as well as a particular viewpoint on thephilosophy statistics. We briefly focus on the latter, through adiscussion of the notion of a severe test (cf. Mayo and Spanos 2006).The claim is that we gain knowledge of experimental effects on thebasis ofseverely testing hypotheses, which can becharacterized by the significance and power. In Mayo's definition, ahypothesis passes a severe test on two conditions: the data must agreewith the hypothesis, and the probability must be very low thatthe data agree with the alternative hypothesis. Ignoring potentialcontroversy over the precise interpretation of “agree” and “lowprobability”, we can recognize the criteria of Neyman and Pearson inthese requirements. The test is severe if the significance is low,since the data must agree with the hypothesis, and the power is high,since those data must not agree, or else have a low probability ofagreeing, with the alternative.
Apart from re-interpretations of the classical statisticalprocedures, numerous statisticians and philosophers have developed thetheory of classical statistics further in order to make good on theepistemic role of its results. We focus on two developments inparticular, to wit, fiducial and evidential probability.
The theory ofevidential probability originates in Kyburg(1961), who developed a logical system to deal consistently with theresults of classical statistical analyses. Evidential probabilitythus falls within the attempts to establish the epistemic use ofclassical statistics. Haenni et al (2010) and Kyburg and Teng (2001)present an insightful introduction to evidential probability. Thesystem is based on a version default reasoning: statistical hypothesescome attached with a confidence level, and the logical systemorganizes how such confidence levels are propagated in inference, andthus advises which hypothesis to use for predictions anddecisions. Particular attention is devoted to the propagation ofconfidence levels in inferences that involve multiple instances of thesame hypothesis tagged with different confidences, where thoseconfidences result from diverse data sets that are each associatedwith a particular population. Evidential probability assists inselecting the optimal confidence level, and thus in choosingthe appropriate population for the case under consideration. Inother words, evidential probability helps to resolve the referenceclass problem alluded in the foregoing.
Fiducial probability presents another way in which classicalstatistics can be given an epistemic status. Fisher (1930, 1933,1935c, 1956/1973) developed the notion offiducialprobability as a way of deriving a probability assignment overhypotheses without assuming a prior probability over statisticalhypotheses at the outset. The fiducial argument is controversial, andit is generally agreed that its applicability is limited to particularstatistical problems. Dempster (1964), Hacking (1965), Edwards (1972),Seidenfeld (1996) and Zabell (1996) provide insightfuldiscussions. Seidenfeld (1979) presents a particularly detailed studyand a further discussion of the restricted applicability of theargument in cases with multiple parameters. Dawid and Stone (1982)argue that in order to run the fiducial argument, one has to assumethat the statistical problem can be captured in a functional modelthat is smoothly invertible. Dempster (1966) provides generalizationsof this idea for cases in which the distribution over \(\theta\) isnot fixed uniquely but only constrained within upper and lower bounds(cf. Haenni et al 2011). Crucially, such constraints on theprobability distribution over values of \(\theta\) are obtainedwithout assuming any distribution over \(\theta\) at the outset.
To explain thefiducial argument we first set up a simpleexample. Say that we estimate the mean \(\theta\) of a normaldistribution with unit variance over a variable \(X\). We collect asample \(s\) consisting of measurements \(X_{1}, X_{2}, \ldotsX_{n}\). The maximum likelihood estimator for \(\theta\) is theaverage value of the \(X_{i}\), that is, \(\hat{\theta}(s) = \sum_{i}X_{i} / n\). Under an assumed true value \(\theta\) we then have anormal distribution for the estimator \(\hat{\theta}(s)\), centred onthe true value and with a variance \(1 / \sqrt{n}\). Notably, thisdistribution has the same shape for all values of \(\theta\). Becauseof this, argued Fisher, we can use the distribution over the estimator\(\hat{\theta}(s)\) as a stand-in for the distribution over the truevalue \(\theta\). We thus derive a probability distribution\(P(\theta)\) on the basis of a sample \(s\), seemingly withoutassuming a prior probability.
There are several ways to clarify this so-called fiducial argument.One way employs a so-calledfunctional model, i.e., thespecification of a statistical model by means of a particularfunction. For the above model, the function is\[ f(\theta, \epsilon) = \theta + \epsilon = \hat{\theta}(s) . \]It relates possible parameter values \(\theta\) to a quantity based onthe sample, in this case the estimator of the observations\(\hat{\theta}\). The two are related through a stochastic component\(\epsilon\) whose distribution is known, and the same for all thesamples under consideration. In our case \(\epsilon\) is distributednormally with variance \(1 / \sqrt{n}\). Importantly, the distributionof \(\epsilon\) is the same for every value of \(\theta\). Theinterpretation of the function \(f\) may now be apparent. Relative tothe choice of a value of \(\theta\), which then obtains the role ofthe true value \(\theta^{\star}\), the distribution over \(\epsilon\)dictates the distribution over the estimator function\(\hat{\theta}(s)\).
The idea of the fiducial argument can now be expressed succinctly.It is to project the distribution over the stochastic component backonto the possible parameter values. The key observation is that thefunctional relation \(f(\theta, \epsilon)\) is smoothly invertible,i.e., the function\[ f^{-1}(\hat{\theta}(s), \epsilon) = \hat{\theta}(s) - \epsilon =\theta \]points each combination of \(\hat{\theta}(s)\) and \(\epsilon\) to aunique parameter value \(\theta\). Hence, we can invert the claim ofthe previous paragraph: relative to fixing a value for\(\hat{\theta}\), the distribution over \(\epsilon\) fully determinesthe distribution over \(\theta\). Hence, in virtue of the invertedfunctional model, we can transfer the normal distribution over\(\epsilon\) to the values \(\theta\) around \(\hat{\theta}(s)\). Thisyields a so-called fiducial probability distribution over theparameter \(\theta\). The distribution is obtained because,conditional on the value of the estimator, the parameters and thestochastic terms become perfectly correlated. A distribution over thelatter is then automatically applicable to the former (cf. Haenni etal, 52-55 and 119–122).
Another way of explaining the same idea invokes the notion of apivotal quantity. Because of how the above statistical modelis set up, we can construct the pivotal quantity \(\hat{\theta}(s) -\theta\). We know the distribution of this quantity, namely normal andwith the aforementioned variance. Moreover, this distribution isindependent of the sample, and it is such that fixing the sample to\(s\), and so fixing the value of \(\hat{\theta}\), uniquelydetermines a distribution over the parameter values \(\theta\). Thefiducial argument thus allows us to construct a probabilitydistribution over the parameter values on the basis of the observedsample. The argument can be run whenever we can construct a pivotalquantity like that or, equivalently, whenever we can express thestatistical model as a functional model.
A warning is in order here. As revealed in many of the abovereferences, the fiducial argument is highly controversial. Themathematical results are there, but the proper interpretation of theresults is still up for discussion . In order to properly appreciatethe precise inferential move and its wobbly conceptual basis, it willbe instructive to consider the use of fiducial probability ininterpreting confidence intervals. A proper understanding of thisrequires first reading theSection 3.1.2.
Recall that confidence intervals, which are standardly taken toindicate the quality of an estimation, are often interpretedepistemically. The 95% confidence interval is often misunderstood asthe range of parameter values that includes the true value with 95%probability, a so-calledcredal interval:\[ P(\theta \in [\hat{\theta} - \Delta, \hat{\theta} + \Delta]) = 0.95. \]This interpretation is at odds with classical statistics but, as willbecome apparent, it can be motivated by an application of the fiducialargument. Say that we replace the integral determining the size\(\Delta\) of the confidence interval by the following:\[ \int_{\hat{\theta}(s) - \Delta}^{\hat{\theta}(s) + \Delta}P_{\theta}(R_{\hat{\theta}(s)}) d\theta = 0.95 .\]In words, we fix the estimator \(\hat{\theta}(s)\) and then integrateover the parameters \(\theta\) in \(P_{\theta}(R_{\hat{\theta}(s)})\),rather than assuming \(\theta^{\star}\) and then integrating over theparameters \(\tau\) in \(R_{\tau}\). Sure enough we can calculate thisintegral. But what ensures that we can treat the integral as aprobability? Notice that it runs over a continuum of probabilitydistributions and that, as it stands, there is no reason to think thatthe terms \(P_{\theta}(R_{\hat{\theta}(s)})\) add up to a properdistribution in \(\theta\).
The assumptions of the fiducial argument, here explained in termsof the invertibility of the functional model, ensure that the termsindeed add up, and that a well-behaved distribution will surface. Wecan choose the statistical model in such a way that the samplestatistic \(\hat{\theta}(s)\) and the parameter \(\theta\) are relatedin the right way: relative to the parameter \(\theta\), we have adistribution over the statistic \(\hat{\theta}\), but by the sametoken we have a distribution over parameters relative to thisstatistic. As a result, the probability function\(P_{\theta}(R_{\hat{\theta}(s) + \epsilon})\) over \(\epsilon\),where \(\theta\) is fixed, can be transferred to a fiducialprobability function \(P_{\theta + \epsilon}(R_{\hat{\theta}(s)})\)over \(\epsilon\), where \(\hat{\theta}(s)\) is fixed. The function\(P_{\theta}(R_{\hat{\theta}})\) of the parameter \(\theta\) is thus aproper probability function, from which a credal interval can beconstructed.
Even then, it is not clear why we should take this distribution asan appropriate expression of our belief, so that we may support theepistemic interpretation of confidence intervals with it. And so thedebate continues. In the end fiducial probability is perhaps bestunderstood as a half-way house between the classical and the Bayesianview on statistics. Classical statistics grew out of a frequentistinterpretation of probability, and accordingly the probabilitiesappearing in the classical statistical methods are all interpreted asfrequencies of events. Clearly, the probability distribution overhypotheses that is generated by a fiducial argument cannot beinterpreted in this way, so that an epistemic interpretation of thisdistribution seems the only option. Several authors (e.g., Dempster1964) have noted that fiducial probability indeed makes most sense ina Bayesian perspective. It is to this perspective that we now turn.
Bayesian statistical methods are often presented in the form of aninference. The inference runs from a so-calledpriorprobability distribution over statistical hypotheses, which expressesthe degree of belief in the hypotheses before data has been collected,to aposterior probability distribution over thehypotheses, which expresses the beliefs after the data have beenincorporated. The posterior distribution follows, via the axioms ofprobability theory, from the prior distribution and thelikelihoods of the hypotheses for the data obtained, i.e.,the probability that the hypotheses assign to the data. Bayesianmethods thus employ data to modulate our attitude towardsa designated set of statistical hypotheses, and in this respectthey achieve the same as classical statistical procedures. Both typesof statistics present a response to the problem of induction. Butwhereas classical procedures select or eliminate elements from the setof hypotheses, Bayesian methods express the impact of data in aposterior probability assignment over the set. This posterior is fullydetermined by the prior and the likelihoods of the hypotheses, via theformalism of probability theory.
The defining characteristic of Bayesian statistics is that itconsiders probability distributions over statistical hypotheses aswell as over data. It embraces the epistemic interpretation ofprobability whole-heartedly: probabilities over hypotheses areinterpreted as degrees of belief, i.e., as expressions of epistemicuncertainty. The philosophy of Bayesian statistics is concernedwith determining the appropriate interpretation of these inputcomponents, and of the mathematical formalism of probability itself,ultimately with the aim to justify the output. Notice that the generalpattern of a Bayesian statistical method is that ofinductivism in the cumulative sense: under the impact ofdata we move to more and more informed probabilistic opinions aboutthe hypotheses. However, in the following it will appear that Bayesianmethods may also be understood as deductivist in nature.
Bayesian inference always starts from astatisticalmodel, i.e., a set of statistical hypotheses. While the generalpattern of inference is the same, we treat models with a finite numberand a continuum of hypotheses separately and draw parallels withhypothesis testing and estimation, respectively. The exposition ismostly based on Press 2002, Howson and Urbach 2006, Gelman et al2013, and Earman 1992.
Central to Bayesian methods is a theorem from probability theoryknown asBayes' theorem. Relative to a prior probabilitydistribution over hypotheses, and the probability distributions oversample space for each hypothesis, it tells us what the adequateposterior probability over hypotheses is. More precisely, let \(s\) bethe sample and \(S\) be the sample space as before, and let \(M = \{h_{\theta} :\: \theta \in \Theta \}\) be the space of statisticalhypotheses, with \(\Theta\) the space of parameter values. Thefunction \(P\) is a probability distribution over the entire space \(M\times S\), meaning that every element \(h_{\theta}\) is associatedwith its own sample space \(S\), and its own probability distributionover that space. For the latter, which is fully determined by thelikelihoods of the hypotheses, we write the probability of the sampleconditional on the hypothesis, \(P(s \mid h_{\theta})\). This differsfrom the expression \(P_{h_{\theta}}(s)\), written in the context ofclassical statistics, because in contrast to classical statisticians,Bayesians accept \(h_{\theta}\) as an argument for the probabilitydistribution.
Bayesian statistics is first introduced in the context of a finiteset of hypotheses, after which a generalization to the infinite caseis provided. Assume the prior probability \(P(h_{\theta})\) over thehypotheses \(h_{\theta} \in M\). Further assume the likelihoods \(P(s\mid h_{\theta})\), i.e., the probability assigned to the data \(s\)conditional on the hypotheses \(h_{\theta}\). Then Bayes' theoremdetermines that\[ P(h_{\theta} \mid s) \; = \; \frac{P(s \mid h_{\theta})}{P(s)}P(h_{\theta}) . \]Bayesian statistics outputs the posterior probability assignment,\(P(h_{\theta} \mid s)\). This expression gets the interpretation ofan opinion concerning \(h_{\theta}\) after the sample \(s\) has beenrecorded accommodated, i.e., it is a revised opinion. Further resultsfrom a Bayesian inference can all be derived from the posteriordistribution over the statistical hypotheses. For instance, we can usethe posterior to determine the most probable value for the parameter,i.e., picking the hypothesis \(h_{\theta}\) for which \(P(h_{\theta}\mid s)\) is maximal.
In this characterization of Bayesian statistical inference theprobability of the data \(P(s)\) is not presupposed, because it can becomputed from the prior and the likelihoods by the law of totalprobability,\[ P(s) \; = \; \sum_{\theta \in \Theta} P(h_{\theta}) P(s \midh_{\theta}) . \]The result of a Bayesian statistical inference is not always reportedas a posterior probability. Often the interest is only in comparingthe ratio of the posteriors of two hypotheses. By Bayes' theorem wehave\[ \frac{P(h_{\theta} \mid s)}{P(h_{\theta'} \mid s)} \; = \;\frac{P(h_{\theta}) P(s \mid h_{\theta})}{P(h_{\theta'}) P(s \midh_{\theta'})} , \]and if we assume equal priors \(P(h_{\theta}) = P(h_{\theta'})\), wecan use the ratio of the likelihoods of the hypotheses, the so-calledBayes factor, to compare the hypotheses.
Here is a Bayesian procedure for the example of the tea tastinglady.
Bayesian statistical analysis
Consider thehypotheses \(h_{1/2}\) and \(h_{3/4}\), which in the foregoing wereused as null and alternative, \(h\) and \(h'\), respectively. Insteadof choosing among them on the basis of the data, we assign a priordistribution over them so that the null is twice as probable as thealternative: \(P(h_{1/2}) = 2/3\) and \(P(h_{3/4}) = 1/3\). Denotingthe a particular sequence of guessing \(n\) out of 5 cups correctlywith \(s_{n/5}\), we have that \(P(s_{n/5} \mid h_{1/2}) = 1 / 2^{5}\)while \(P(s_{n/5} \mid h_{3/4}) = 3^{n} / 4^{5}\). As before, thelikelihood ratio of five guesses thus becomes\[ \frac{P(s_{n/5} \mid h_{3/4})}{P(s_{n/5} \mid h_{1/2})} \; = \;\frac{3^{n}}{2^{5}} . \]The posterior ratio after 5 correct guesses is thus\[ \frac{P(h_{3/4} \mid s_{n/5})}{P(h_{1/2} \mid s_{n/5})} \; = \;\frac{3^{5}}{2^{5}}\, \frac{1}{2} \approx 4 . \]This posterior is derived by the axioms of probability theory alone,in particular by Bayes' theorem. It tells us how believable each ofthe hypotheses is after incorporating the sample data into ourbeliefs.
Notice that in the above exposition, the posterior probability iswritten as \(P(h_{\theta} \mid s_{n/5})\). Some expositions ofBayesian inference prefer to express the revised opinion as a newprobability function \(P'( \cdot )\), which is then equated to the old\(P( \cdot \mid s)\). For the basic formal workings of Bayesianinference, tis distinction is inessential. But we will return to it inSection 4.3.3.
In many applications the model is not a finite set of hypotheses,but rather a continuum labelled by a real-valued parameter. This leadsto some subtle changes in the definition of the distribution overhypotheses and the likelihoods. The prior and posterior must bewritten down as a so-calledprobability density function,\(P(h_{\theta}) d\theta\). The likelihoods need to be defined by alimit process: the probability \(P(h_{\theta})\) is infinitely smallso that we cannot define \(P(s \mid h_{\theta})\) in the normalmanner. But other than that the Bayesian machinery works exactly thesame:\[ P(h_{\theta} \mid s) d\theta \;\; = \;\; \frac{P(s \midh_{\theta})}{P(s)} P(h_{\theta}) d\theta. \]Finally, summations need to be replaced by integrations:\[ P(s) \; = \; \int_{\theta \in \Theta} P(h_{\theta}) P(s \midh_{\theta}) d\theta . \]This expression is often called themarginal likelihood ofthe model: it expresses how probable the data is in the light of themodel as a whole.
The posterior probability density provides a basis for conclusionsthat one might draw from the sample \(s\), and which are similar toestimations and measures for the accuracy of the estimations. For one,we can derive an expectation for the parameter \(\theta\), where weassume that \(\theta\) varies continuously:\[ \bar{\theta} \;\; = \;\; \int_{\Theta}\, \theta P(h_{\theta} \mid s)d\theta. \]If the model is parameterized by a convex set, which it typically is,then there will be a hypothesis \(h_{\bar{\theta}}\) in themodel. This hypothesis can serve as a Bayesian estimation. In analogyto the confidence interval, we can also define a so-calledcredalinterval orcredibility intervalfrom the posterior probability distribution: an interval of size\(2d\) around the expectation value \(\bar{\theta}\), written\([\bar{\theta} - d, \bar{\theta} + d]\), such that\[ \int_{\bar{\theta} - d}^{\bar{\theta} + d} P(h_{\theta} \mid s)d\theta = 1-\epsilon . \]This range of values for \(\theta\) is such that the posteriorprobability of the corresponding \(h_{\theta}\) adds up to\(1-\epsilon\) of the total posterior probability.
There are many other ways of defining Bayesian estimations andcredal intervals for \(\theta\) on the basis of the posteriordensity. The specific type of estimation that the Bayesian analysisoffers can be determined by the demands of the scientist. Any Bayesianestimation will to some extent resemble the maximum likelihoodestimator due to the central role of the likelihoods in the Bayesianformalism. However, the output will also depend on the priorprobability over the hypotheses, and generally speaking it will onlytend to the maximum likelihood estimator when the sample size tends toinfinity. SeeSection 4.2.2 for more on this so-called “washing out” of thepriors.
Most of the controversy over the Bayesian method concerns theprobability assignment over hypotheses. One important set of problemssurrounds the interpretation of those probabilities as beliefs, as todo with a willingness to act, or the like. Another set ofproblems pertains to the determination of the prior probabilityassignment, and the criteria that might govern it.
The overall question here is how we should understand theprobability assigned to a statistical hypothesis. Naturally theinterpretation will be epistemic: the probability expresses thestrength of belief in the hypothesis. It makes little sense to attempta physical interpretation since the hypothesis cannot be seen as arepeatable event, or as an event that might have some tendency ofoccurring.
This leaves open several interpretations of the probabilityassignment as a strength of belief. One very influentialinterpretation of probability as degree of belief relates probabilityto a willingness to bet against certain odds (cf. Ramsey 1926, DeFinetti 1937/1964, Earman 1992, Jeffrey 1992, Howson 2000). Accordingto this interpretation, assigning a probability of \(3/4\) to aproposition, for example, means that we are prepared to pay at most$0.75 for a betting contract that pays out $1 if theproposition is true, and that turns worthless if the proposition isfalse. The claim that degrees of belief are correctly expressed in aprobability assignment is then supported by a so-calledDutch bookargument: if an agent does not comply to the axioms ofprobability theory, a malign bookmaker can propose a set of bets thatseems fair to the agent but that lead to a certain monetary loss, andthat is therefore called Dutch, presumably owing to the Dutch'smercantile reputation. This interpretation associates beliefs directlywith their behavioral consequences: believing something is the same ashaving the willingness to engage in a particular activity, e.g., in abet.
There are several problems with this interpretation of theprobability assignment over hypotheses. For one, it seems to makelittle sense to bet on the truth of a statistical hypothesis, becausesuch hypotheses cannot be falsified or verified. Consequently, abetting contract on them will never be cashed. More generally, it isnot clear that beliefs about statistical hypotheses are properlyframed by connecting them to behavior in this way. It has been argued(e.g., Armendt 1993) that this way of framing probability assignmentsintroduces pragmatic considerations on beliefs, to do with navigatingthe world successfully, into a setting that is by itself moreconcerned with belief as a truthful representation of the world.
A somewhat different problem is that the Bayesian formalism, inparticular its use of probability assignments over statisticalhypotheses, suggests a remarkable closed-mindedness on the part of theBayesian statistician. Recall the example of the foregoing, with themodel \(M = \{ h_{1/2}, h_{3/4} \}\). The Bayesian formalism requiresthat we assign a probability distribution over these two hypotheses,and further that the probability of the model is \(P(M) = 1\). It isquite a strong assumption, even of an ideally rational agent, that sheis indeed equipped with a real-valued function that expresses heropinion over the hypotheses. Moreover, the probability assignment overhypotheses seems to entail that the Bayesian statistician is certainthat the true hypothesis is included in the model. This is an undulystrong claim to which a Bayesian statistician will have to commit atthe start of her analysis. It sits badly with broadly sharedmethodological insights (e.g., Popper 1934/1956), according to whichscientific theory must be open to revision at all times (cf. Mayo1996). In this regard Bayesian statistics does not do justice to thenature of scientific inquiry, or so it seems.
The problem just outlined obtains a mathematically moresophisticated form in the problem that Bayesians expect to bewell-calibrated. This problem, as formulated in Dawid(1982), concerns a Bayesian forecaster, e.g., a weatherman whodetermines a daily probability for precipitation in the next day. Itis then shown that such a weatherman believes of himself that inthe long run he will converge onto the correct probability withprobability 1. Yet it seems reasonable to suppose that the weathermanrealizes something could potentially be wrong with his meteorologicalmodel, and so sets his probability for correct prediction below 1. Theweatherman is thus led to incoherent beliefs. It seems that Bayesianstatistical analysis places unrealistic demands, even on an idealagent.
For the moment, assume that we can interpret the probability overhypotheses as an expression of epistemic uncertainty. Then how do wedetermine a prior probability? Perhaps we already have an intuitivejudgment on the hypotheses in the model, so that we can pin down theprior probability on that basis. Or else we might have additionalcriteria for choosing our prior. However, several serious problemsattach to procedures for determining the prior.
First consider the idea that the scientist who runs the Bayesiananalysis provides the prior probability herself. One obvious problemwith this idea is that the opinion of the scientist might not beprecise enough for a determination of a full prior distribution. Itdoes not seem realistic to suppose that the scientist can transformher opinion into a single real-valued function over the model,especially not if the model itself consists of a continuum ofhypotheses. But the more pressing problem is that different scientistswill provide different prior distributions, and that these differentpriors will lead to different statistical results. In other words,Bayesian statistical inference introduces an inevitable subjectivecomponent into scientific method.
It is one thing that the statistical results depend on the initialopinion of the scientist. But it may so happen that the scientist hasno opinion whatsoever about the hypotheses. How is she supposed toassign a prior probability to the hypotheses then? The prior will haveto express her ignorance concerning the hypotheses. The leading ideain expressing such ignorance is usually theprinciple ofindifference: ignorance means that we are indifferent between anypair of hypotheses. For a finite number of hypotheses, indifferencemeans that every hypothesis gets equal probability. For a continuum ofhypotheses, indifference means that the probability densityfunction must be uniform.
Nevertheless, there are different ways of applying theprinciple of indifference and so there are different probabilitydistributions over the hypotheses that can count as expression ofignorance. This insight is nicely illustrated in Bertrand's paradox.
Bertrand's paradox
Consider a circle drawnaround an equilateral triangle, and now imagine that a knitting needlewhose length exceeds the circle's diameter is thrown onto thecircle. What is the probability that the section of the needle lyingwithin the circle is longer than the side of the equilateral triangle?To determine the answer, we need to parameterize the ways in which theneedle may be thrown, determine the subset of parameter values forwhich the included section is indeed longer than the triangle's side,and express our ignorance over the exact throw of the needle in aprobability distribution over the parameter, so that the probabilityof the said event can be derived. The problem is that we may provideany number of ways to parameterize how the needle lands in thecircle. If we use the angle that the needle makes with the tangent ofthe circle at the intersection, then the included section of theneedle is only going to be longer if the angle is between\(60^{\circ}\) and \(120^{\circ}\). If we assume that our ignorance isexpressed by a uniform distribution over these angles, which rangesfrom \(0^{\circ}\) to \(180^{\circ}\), then the probability of theevent is going to be \(1/3\). However, we can also parameterize theways in which the needle lands differently, namely by the shortestdistance of the needle to the centre of the circle. A uniformprobability over the distances will lead to a probability of\(1/2\).
Jaynes (1973 and 2003) provides a very insightful discussion ofthis riddle and also argues that it may be resolved by relying oninvariances of the problem under certain transformations. But thegeneral message for now is that the principle of indifference does notlead to a unique choice of priors. The point is not that ignoranceconcerning a parameter is hard to express in a probabilitydistribution over those values. It is rather that in some cases, we donot even know what parameters to use to express our ignoranceover.
In part the problem of the subjectivity of Bayesian analysis may beresolved by taking a different attitude to scientific theory, and bygiving up the ideal of absolute objectivity. Indeed, some will arguethat it is just right that the statistical methods accommodatedifferences of opinion among scientists. However, this response missesthe mark if the prior distribution expresses ignorance rather thanopinion: it seems harder to defend the rationality of differences ofopinion that stem from different ways of spelling out ignorance. Nowthere is also a more positive answer to worries over objectivity,based on so-calledconvergence results(e.g., Blackwell andDubins 1962 and Gaifman and Snir 1982). It turns out that the impactof prior choice diminishes with the accumulation of data, and that inthe limit the posterior distribution will converge to a set, possiblya singleton, of best hypotheses, determined by the sampled data andhence completely independent of the prior distribution. However, inthe short and medium run the influence of subjective prior choiceremains.
Summing up, it remains problematic that Bayesian statistics issensitive to subjective input. The undeniable advantage of theclassical statistical procedures is that they do not need any suchinput, although arguably the classical procedures are in turnsensitive to choices concerning the sample space (Lindley 2000).Against this, Bayesian statisticians point to the advantage of beingable to incorporate initial opinions into the statisticalanalysis.
The philosophy of Bayesian statistics offers a wide range ofresponses to the problems outlined above. Some Bayesians bite thebullet and defend the essentially subjective character of Bayesianmethods. Others attempt to remedy or compensate for the subjectivity,by providing objectively motivated means of determining the priorprobability or by emphasizing the objective character of the Bayesianformalism itself.
One very influential view on Bayesian statistics buys into thesubjectivity of the analysis (e.g., Goldstein 2006, Kadane 2011).So-calledpersonalists orstrict subjectivistsarguethat it is just right that the statistical methods do not provide anyobjective guidelines, pointing to radically subjective sources of anyform of knowledge. The problems on the interpretation and choice ofthe prior distribution are thus dissolved, at least in part: theBayesian statistician may choose her prior at will, and they are anexpression of her beliefs. However, it deserves emphasis that asubjectivist view on Bayesian statistics does not mean that allconstraints deriving from empirical fact can be disregarded. Nobodydenies that if you have further knowledge that imposes constraints onthe model or the prior, then those constraints must beaccommodated. For example, today's posterior probability may be usedas tomorrow's prior, in the next statistical inference. The point isthat such constraints concern the rationality of belief and not theconsistency of the statistical inference per se.
Subjectivist views are most prominent among those who interpretprobability assignments in a pragmatic fashion, and motivate therepresentation of belief with probability assignments by theafore-mentioned Dutch book arguments. Central to this approach is thework of Savage and De Finetti. Savage (1962) proposed to axiomatizestatistics in tandem withdecision theory, a mathematicaltheory about practical rationality. He argued that by themselves theprobability assignments do not mean anything at all, and that they canonly be interpreted in the context where an agent faces a choicebetween actions, i.e., a choice among a set of bets. In similar vein,De Finetti (e.g., 1974) advocated a view on statistics in which onlythe empirical consequences of the probabilistic beliefs, expressed ina willingness to bet, mattered but he did not make statisticalinference fully dependent on decision theory. Remarkably, it thusappears that the subjectivist view on Bayesian statistics is based onthe same behaviorism and empiricism that motivated Neyman and Pearsonto develop classical statistics.
Notice that all this makes one aspect of the interpretation problemofSection 4.2.1reappear: how will the prior distribution over hypotheses make itselfapparent in behavior, so that it can rightfully be interpreted interms of belief, here understood as a willingness to act? One responseto this question is to turn to different motivations for representingdegrees of beliefs by means of probability assignments. Followingwork by De Finetti, several authors have proposed vindications ofprobabilistic expressions of belief that are not based on behavioralgoals, but rather on the epistemic goal of holding beliefs thataccurately represent the world, e.g., Rosenkrantz (1981), Joyce(2001), Leitgeb and Pettigrew (2010), Easwaran (2013). A stronggeneralization of this idea is achieved in Schervish, Seidenfeld andKadane (2009), which builds on a longer tradition of using scoringrules for achieving statistical aims. An alternative approach is thatany formal representation of belief must respect certain logicalconstraints, e.g., Cox provides an argument for the expression ofbelief in terms of probability assignments on the basis of the natureof partial belief per se.
However, the original subjectivist response to the issue that aprior over hypotheses is hard to interpret came from De Finetti'sso-calledrepresentation theorem, which shows that everyprior distribution can be associated with its own set of predictions,and hence with its own behavioral consequences. In other words, DeFinetti showed how priors are indeed associated with beliefs that cancarry a betting interpretation.
De Finetti's representation theorem relates rules forprediction, as functions of the given sample data, to Bayesianstatistical analyses of those data, against the background of astatistical model. See Festa (1996) and Suppes (2001) for usefulintroductions. De Finetti considers a process that generates a seriesof time-indexed observations, and he then studies prediction rulesthat take these finite segments as input and return a probability overfuture events, using a statistical model that can analyze such samplesand provide the predictions. The key result of De Finetti is that aparticular statistical model, namely the set of all distributions inwhich the observations are independently and identicallydistributed, can be equated with the class ofexchangeableprediction rules, namely the rules whose predictions do notdepend on the order in which the observations come in.
Let us consider the representation theorem in some more formaldetail. For simplicity, say that the process generates time-indexedbinary observations, i.e., 0's and 1's. The prediction rules take suchbit strings of length \(t\), denoted \(S_{t}\), as input, and return aprobability for the event that the next bit in the string is a 1,denoted \(Q^{1}_{t+1}\). So we write the prediction rules as partialprobability assignments \(P(Q^{1}_{t+1} \mid S_{t})\). Exchangeableprediction rules are rules that deliver the same predictionindependently of the order of the bits in the string \(S_{t}\). If wewrite the event that the string \(S_{t}\) has a total of \(n\)observations of 1's as \(S_{n/t}\), then exchangeable prediction rulesare written as \(P(Q^{1}_{t+1} \mid S_{n/t})\). The crucial propertyis that the value of the prediction is not affected by the order inwhich the 0's and 1's show up in the string \(S_{t}\).
De Finetti relates this particular set of exchangeable predictionrules to a Bayesian inference over a specific type of statisticalmodel. The model that De Finetti considers comprises the so-calledBernoulli hypotheses \(h_{\theta}\), i.e., hypotheses forwhich\[ P(Q^{1}_{t+1} \mid h_{\theta} \cap S_{t}) = \theta . \]This likelihood does not depend on the string \(S_{t}\) that has gonebefore. The hypotheses are best thought of as determining a fixed bias\(\theta\) for the binary process, where \(\theta \in \Theta = [0,1]\). Therepresentation theoremstates that there is aone-to-one mapping of priors over Bernoulli hypotheses andexchangeable prediction rules. That is, every prior distribution\(P(h_{\theta})\) can be associated with exactly one exchangeableprediction rule \(P(Q^{1}_{t+1} \mid S_{n/t})\), and conversely. Nextto the original representation theorem derived by De Finetti, severalother and more general representation theorems were proved, e.g., forpartially exchangeable sequences and hypotheses on Markov processes(Diaconis and Freedman 1980, Skyrms 1991), for clustering predictionsand partitioning processes (Kingman 1975 and 1978), and even forsequences of graphs and their generating process (Aldous 1981).
Representation theorems equate a prior distribution overstatistical hypotheses to a prediction rule, and thus to a probabilityassignment that can be given a subjective and behavioralinterpretation. This removes the worry expressed above, that the priordistribution over hypotheses cannot be interpreted subjectivelybecause it cannot be related to belief as a willingness to act: priorsrelate uniquely to particular predictions. However, for De Finetti therepresentation theorem provided a reason for doing away withstatistical hypotheses altogether, and hence for the removal of anotion of probability as anything other than subjective opinion (cf.Hintikka 1970): hypotheses whose probabilistic claims could be takento refer to intangible chancy processes are superfluous metaphysicalbaggage.
Not all subjectivists are equally dismissive of the use ofstatistical hypotheses. Jeffrey (1992) has proposed so-calledmixed Bayesianism in which subjectively interpreteddistributions over the hypotheses are combined with a physicalinterpretation of the distributions that hypotheses define over samplespace. Romeijn (2003, 2005, 2006) argues that priors over hypothesesare an efficient and more intuitive way of determining inductivepredictions than specifying properties of predictive systems directly. This advantage of using hypotheses seems in agreement with the practice of science,in which hypotheses are routinely used, and often motivated by mechanistic knowledge on the data generating process. The fact that statistical hypotheses canstrictly speaking be eliminated does not take away from their utility in making predictions.
Despite its—seemingly inevitable—subjective character,there is a sense in which Bayesian statistics might lay claim toobjectivity. It can be shown that the Bayesian formalism meets certainobjective criteria of rationality, coherence, andcalibration. Bayesian statistics thus answers to the requirement ofobjectivity at a meta-level: while the opinions that it deals withretain a subjective aspect, the way in which it deals with theseopinions, in particular the way in which data impacts on them, isobjectively correct, or so it is argued. Arguments supporting theBayesian way of accommodating data, namely byconditionalization, have been provided in a pragmatic contextbydynamic Dutch book arguments, whereby probability isinterpreted as a willingness to bet (cf. Maher 1993, van Fraassen1989). Similar arguments have been advanced on the grounds that ourbeliefs must accurately represent the world along the lines of DeFinetti (1974), e.g., Greaves and Wallace (2006) and Leitgeb andPettigrew (2010).
An important distinction must be made in arguments that support theBayesian way of accommodating evidence: the distinction between Bayes'theorem, as a mathematical given, andBayes' rule, as aprinciple of coherence over time. The theorem is simply a mathematicalrelation among probability assignments,\[ P(h \mid s) \; = \; P(h) \frac{P(s \mid h)}{P(s)} , \]and as such not subject to debate. Arguments that support therepresentation of the epistemic state of an agent by means ofprobability assignments also provide support for Bayes' theorem as aconstraint on degrees of belief. The conditional probability \(P(h\mid s)\) can be interpreted as the degree of belief attached to thehypothesis \(h\) on the condition that the sample \(s\) is obtained,as integral part of the epistemic state captured by the probabilityassignment. Bayes' rule, by contrast, presents a constraint onprobability assignments that represent epistemic states of an agent atdifferent points in time. It is written as\[ P_{s}(h) \; = P(h \mid s) , \]and it determines that the new probability assignment, expressing theepistemic state of the agent after the sample has been obtained, issystematically related to the old assignment, representing theepistemic state before the sample came in. In the philosophy ofstatistics many Bayesians adopt Bayes' rule implicitly, but in whatfollows I will only assume that Bayesian statistical inferences relyon Bayes' theorem.
Whether the focus lies on Bayes' rule or on Bayes' theorem, thecommon theme in the above-mentioned arguments is that they approachBayesian statistical inference from a logical angle, and focus on itsinternal coherence or consistency (cf. Howson 2003). While its use instatistics is undeniably inductive, Bayesian inference thereby obtainsa deductive, or at least non-ampliative character: everything that isconcluded in the inference is somehow already present in thepremises. In Bayesian statistical inference, those premises are givenby the prior over the hypotheses, \(P(h_{\theta})\) for \(\theta \in\Theta\), and the likelihood functions, \(P(s \mid h_{\theta})\), asdetermined for each hypothesis \(h_{\theta}\) separately. Thesepremises fix a single probability assignment over the space \(M \timesS\) at the outset of the inference. The conclusions, in turn, arestraightforward consequences of this probability assignment. They canbe derived by applying theorems of probability theory, most notablyBayes' theorem. Bayesian statistical inference thus becomes aninstance ofprobabilistic logic (cf. Hailperin 1986, Halpern2003, Haenniet al 2011).
Summing up, there are several arguments showing that statisticalinference by Bayes' theorem, or by Bayes' rule, is objectivelycorrect. These arguments invite us to consider Bayesian statistics asan instance of probabilistic logic. Such appeals to the logicality ofBayesian statistical inference may provide a partial remedy for itssubjective character. Moreover, a logical approach to the statisticalinferences avoids the problem that the formalism placesunrealistic demands on the agents, and that it presumes the agent tohave certain knowledge. Much like in deductive logic, we need notassume that the inferences are psychologically realistic, nor that theagents actually believe the premises of the arguments. Rather thearguments present the agents with a normative ideal and take theconditional form of consistency constraints: if you accept thepremises, then these are the conclusions.
An important instance of probabilistic logic is presented ininductive logic, as devised by Carnap, Hintikka and others(Carnap 1950 and 1952, Hintikka and Suppes 1966, Carnap and Jeffrey1970, Hintikka and Niiniluoto 1980, Kuipers 1978, and Paris 1994, Nixand Paris 2006, Paris and Waterhouse 2009). Historically, Carnapianinductive logic developed prior to the probabilistic logics referencedabove, and more or less separately from the debates in the philosophyof statistics. But the logical systems of Carnap can quite easily beplaced in the context of a logical approach to Bayesian inference, anddoing this is in fact quite insightful.
For simplicity, we choose a setting that is similar to the one usedin the exposition of the representation theorem, namely a binary datagenerating process, i.e., strings of 0's and 1's. A prediction ruledetermines a probability for the event, denoted \(Q^{1}_{t+1}\), thatthe next bit in the string is a 1, on the basis of a given string ofbits with length \(t\), denoted by \(S_{t}\). Carnap and followersdesigned specific exchangeable prediction rules, mostly variants ofthestraight rule (Reichenbach 1938),\[ P(Q^{1}_{t+1} \mid S_{n/t}) = \frac{n + 1}{t + 2} , \]where \(S_{n/t}\) denotes a string of length \(t\) of which \(n\)entries are 1's. Carnap derived such rules from constraints on theprobability assignments over the samples. Some of these constraintsboil down to the axioms of probability. Other constraints,exchangeability among them, are independently motivated, by an appealto so-calledlogical interpretation of probability. Underthis logical interpretation, the probability assignment must respectcertain invariances under transformations of the sample space, inanalogy to logical principles that constrain truth valuations over alanguage in a particular way.
Carnapian inductive logic is an instance of probabilistic logic,because its sequential predictions are all based on a singleprobability assignment at the outset, and because it relies on Bayes'theorem to adapt the predictions to sample data (cf. Romeijn 2011).One important difference with Bayesian statistical inference is that,for Carnap, the probability assignment specified at the outset onlyranges over samples and not over hypotheses. However, by De Finetti'srepresentation theorem Carnap's exchangeable rules can be equated toparticular Bayesian statistical inferences. A further difference isthat Carnapian inductive logic gives preferred status to particularexchangeable rules. In view of De Finetti's representation theorem,this comes down to the choice for a particular set of preferredpriors. As further developed below, Carnapian inductive logic is thusrelated to objective Bayesian statistics. It is a moot point whetherfurther constraints on the probability assignments can be consideredas logical, as Carnap and followers have it, or whether the title oflogic is best reserved for the probability formalism in isolation, asDe Finetti and followers argue.
A further set of responses to the subjectivity of Bayesianstatistical inference targets the prior distribution directly: wemight provide further rationality principles, with which the choice ofpriors can be chosen objectively. The literature proposes severalobjective criteria for filling in the prior over the model. Each ofthese lays claim to being the correct expression of complete ignoranceconcerning the value of the model parameters, or of minimalinformation regarding the parameters. Three such criteria arediscussed here.
In the context of Bertrand's paradox we already discussedthe principle of indifference, according to which probabilityshould be distributed evenly over the available possibilities. Afurther development of this idea is presented by the requirement thata distribution should have maximum entropy. Notably, the use ofentropy maximization for determining degrees of beliefs finds muchbroader application than only in statistics: similar ideas are takenup in diverse fields like epistemology (e.g., Shore and Johnson 1980,Williams 1980, Uffink 1996, and also Williamson 2010), inductive logic(Paris and Vencovska 1989), statistical mechanics (Jaynes 2003)and decision theory (Seidenfeld 1986, Grunwald and Halpern 2004). Inobjective Bayesian statistics, the idea is applied to theprior distribution over the model (cf. Berger 2006). For a finitenumber of hypotheses the entropy of the distribution \(P(h_{\theta})\)is defined as\[ E[P] \; = \; \sum_{\theta \in \Theta} P(h_{\theta}) \logP(h_{\theta}) . \]This requirement unequivocally leads to equiprobablehypotheses. However, for continuous models the maximum entropydistribution depends crucially on the metric over the parameters inthe model. The burden of subjectivity is thereby moved to theparameterization, but of course it may well be that we have strongreasons for preferring a particular parameterization over others (cf.Jaynes 1973).
There are other approaches to the objective determination ofpriors. In view of the above problems, a particularly attractivemethod for choosing a prior over a continuous model is proposed byJeffreys (1961). The general idea of so-calledJeffreyspriors is that the prior probability assigned to a small patch inthe parameter space is proportional to, what may be called, thedensity of the distributions within that patch. Intuitively, if a lotof distributions, i.e., distributions that differ quite a lot amongthemselves, are packed together on a small patch in the parameterspace, this patch should be given a larger prior probability than asimilar patch within which there is little variation among thedistributions (cf. Balasubramanian 2005). More technically, such adensity is expressed by a prior distribution that is proportional totheFisher information. A key advantage of these priors isthat they are invariant under reparameterizations of the parameterspace: a new parameterization naturally leads to an adjusted densityof distributions.
A final method of defining priors goes under the name ofreference priors (Berger et al 2009). The proposalstarts from the observation that we should minimize the subjectivityof the results of our statistical analysis, and hence that we shouldminimize the impact of the prior probability on the posterior. Theidea of reference priors is exactly that it will allow the sample dataa maximal say in the posterior distribution. But since at the outsetwe do not know what sample we will obtain, the prior is chosen so asto maximize the expected impact of the data. The expectation mustitself be taken with respect to some distribution over sample space,but again, it may well be that we have strong reasons for this latterdistribution.
A different response to the subjectivity of priors is toextend the Bayesian formalism, in order to leave the choice of priorto some extent open. The subjective choice of a prior is in that casecircumvented. Two such responses will be considered in somedetail.
Recall that a prior probability distribution over statisticalhypotheses expresses our uncertain opinion on which of the hypothesesis right. The central idea behindhierarchical Bayesianmodels (Gelman et al 2013) is that the same pattern of putting aprior over statistical hypotheses can be repeated on the level ofpriors itself. More precisely, we may be uncertain over which priorprobability distribution over the hypotheses is right. If wecharacterize possible priors by means of a set of parameters, we canexpress this uncertainty about prior choice in a probabilitydistribution over the parameters that characterize the shape of theprior. In other words, we move our uncertainty one level up in ahierarchy: we consider multiple priors over the statisticalhypotheses, and compare the performance of these priors on the sampledata as if the priors were themselves hypotheses.
The idea of hierarchical Bayesian modeling (Gelman et al 2013)relates naturally to the Bayesian comparison of Carnapian predictionrules (e.g., Skyrms 1993 and 1996, Festa 1996), and also to theestimation of optimum inductive methods (Kuipers 1986, Festa 1993).Hierarchical Bayesian modeling can also be related to anothertool for choosing a particular prior distribution over hypotheses,namely the method ofempirical Bayes, which estimates theprior that leads to the maximal marginal likelihood of the model. Inthe philosophy of science, hierarchical Bayesian modeling has made afirst appearance due to Hendersonet al(2010).
There is also a response that avoids the choice of a prioraltogether. This response starts with the same idea as hierarchicalmodels: rather than considering a single prior over the hypotheses inthe model, we consider a parameterized set of them. But instead ofdefining a distribution over this set, proponents ofinterval-valued orimprecise probability claim thatour epistemic state regarding the priors is better expressed by thisset of distributions, and that sharp probability assignments musttherefore be replaced by lower and upper bounds to theassignments. Now the idea that uncertain opinion is best captured by aset of probability assignments, or acredal set for short,has a long history and is backed by an extensive literature (e.g., DeFinetti 1974, Levi 1980, Dempster 1967 and 1968, Shafer 1976, Walley1991). In light of the main debate in the philosophy of statistics,the use of interval-valued priors indeed forms an attractive extensionof Bayesian statistics: it allows us to refrain from choosing aspecific prior, and thereby presents a rapprochement to the classicalview on statistics.
These theoretical developments may look attractive, but the fact isthat they mostly enjoy a cult status among philosophers of statisticsand that they have not moved the statistician in the street. On theother hand, standard Bayesian statistics has seen a steep rise inpopularity over the past decade or so, owing to the availability ofgood software and numerical approximation methods. And most of thepractical use of Bayesian statistics is more or less insensitive tothe potentially subjective aspects of the statistical results,employing uniform priors as a neutral starting point for the analysisand relying on the afore-mentioned convergence results to wash out theremaining subjectivity (cf. Gelman and Shalizi 2013). However, thispractical attitude of scientists towards modelling should not bemistaken for a principled answer to the questions raised in thephilosophy of statistics (see Moreyet al 2013).
In the foregoing we have seen how classical and Bayesian statisticsdiffer. But the two major approaches to statistics also have a lot incommon. Most importantly, all statistical procedures rely on theassumption of astatistical model, here referring to anyrestricted set of statistical hypotheses. Moreover, they are bothaimed at delivering a verdict over these hypotheses. For example, aclassical likelihood ratio test considers two hypotheses, \(h\) and\(h'\), and then offers a verdict of rejection and acceptance, while aBayesian comparison delivers a posterior probability over these twohypotheses. Whereas in Bayesian statistics the model presents a verystrong assumption, classical statistics does not endow the model witha special epistemic status: they are simply the hypotheses currentlyentertained by the scientist. But across the board, the adoption of amodel is absolutely central to any statistical procedure.
A natural question is whether anything can be said about thequality of the statistical model, and whether any verdict on thisstarting point for statistical procedures can be given. Surely somemodels will lead to better predictions, or be a better guide to thetruth, than others. The evaluation of models touches on deep issues inthe philosophy of science, because the statistical model oftendetermines how the data-generating system under investigation isconceptualized and approached (Kieseppa 2001). Model choice thusresembles the choice of a theory, a conceptual scheme, or even of awhole paradigm, and thereby might seem to transcend the formalframeworks for studying theoretical rationality (cf. Carnap 1950,Jeffrey 1980). Despite the fact that some considerations on modelchoice will seem extra-statistical, in the sense that they falloutside the scope of statistical treatment, statistics offers severalmethods for approaching the choice of statistical models.
There are in fact very many methods for evaluating statisticalmodels (Claeskens and Hjort 2008, Wagenmakers and Waldorp 2006). Infirst instance, the methods occasion the comparison of statisticalmodels, but very often they are used for selecting one model over theothers. In what follows we only review prominent techniques that haveled to philosophical debate: Akaike's information criterion, theBayesian information criterion, and furthermore the computation ofmarginal likelihoods and posterior model probabilities, bothassociated with Bayesian model selection. We leave aside methods thatuse cross-validation as they have, unduly, not received as muchattention in the philosophical literature.
Akaike's information criterion, modestly termed AnInformation Criterion or AIC for short, is based on the classicalstatistical procedure of estimation (see Burnham and Anderson 2002,Kieseppa 1997). It starts from the idea that a model \(M\) can bejudged by the estimate \(\hat{\theta}\) that it delivers, and morespecifically by the proximity of this estimate to the distributionwith which the data are actually generated, i.e., the truedistribution. This proximity is often equated with the expectedpredictive accuracy of the estimate, because if the estimate and thetrue distribution are closer to each other, their predictions will bebetter aligned to one another as well. In the derivation of the AIC,the so-called relative entropy orKullback-Leibler divergenceof the two distributions is used as a measure of their proximity, andhence as a measure of the expected predictive accuracy of theestimate.
Naturally, the true distribution is not known to the statisticianwho is evaluating the model. If it were, then the whole statisticalanalysis would be useless. However, it turns out that we can give anunbiased estimation of the divergence between the true distributionand the distribution estimated from a particular model, \[ \text{AIC}[M] = - 2 \log P( s \mid h_{\hat{\theta}(s)} ) + 2 d , \]in which \(s\) is the sample data, \(\hat{\theta}(s)\) is the maximumlikelihood estimate (MLE) of the model \(M\), and \(d = dim(\Theta)\)is the number of dimensions of the parameter space of the model. TheMLE of the model thereby features in an expression of the modelquality, i.e., in a role that is conceptually distinct from theestimator function.
As can be seen from the expression above, a model with a smallerAIC is preferable: we want the fit to be optimal at little cost incomplexity. Notice that the number of dimensions, or independentparameters, in the model increases the AIC and thereby lowers theeligibility of the model: if two models achieve the same maximumlikelihood for the sample, then the model with fewer parameters willbe preferred. For this reason, statistical model selection by the AICcan be seen as an independent motivation for preferring simple modelsover more complex ones (Sober and Forster 1994). But this result alsoinvites some critical remarks. For one, we might impose other criteriathan merely the unbiasedness on the estimation of the proximity to thetruth, and this will lead to different expressions for theapproximation. Moreover, it is not always clearcut what thedimensions of the model under scrutiny really are. For curve fittingthis may seem simple, but for more complicated models or differentconceptualizations of the space of models, things do not look so easy(cf. Myung et al 2001, Kieseppa 2001).
A prime example of model selection is presented incurvefitting. Given a sample \(s\) consisting of a set of points inthe plane \((x, y)\), we are asked to choose the curve that fits thesedata best. We assume that the models under consideration are of theform \(y = f(x) + \epsilon\), where \(\epsilon\) is a normaldistribution with mean 0 and a fixed standard deviation, and where\(f\) is a polynomial function. Different models are characterized bypolynomials of different degrees that have different numbers ofparameters. Estimations fix the parameters of these polynomials. Forexample, for the 0-degree polynomial \(f(x) = c_{0}\) we estimate theconstant \(\hat{c_{0}}\) for which the probability of the data ismaximal, and for the 1-degree polynomial \(f(x) = c_{0} + c_{1}\, x\)we estimate the slope \(\hat{c_{1}}\) and the offset\(\hat{c_{0}}\). Now notice that for a total of \(n\) points, we canalways find a polynomial of degree \(n\) that intersects with allpoints exactly, resulting in a comparatively high maximum likelihood\(P(s \mid \{\hat{c_{0}}, \ldots \hat{c_{n}} \})\). Applying the AIC,however, we will typically find that some model with a polynomial ofdegree \(k < n\) is preferable. Although \(P(s \mid \{\hat{c_{0}},\ldots \hat{c_{k}} \})\) will be somewhat lower, this is compensatedfor in the AIC by the smaller number of parameters.
Various other prominent model selection tools are based on methodsfrom Bayesian statistics. They all start from the idea that thequality of a model is expressed in the performance of the model on thesample data: the model that, on the whole, makes the sampled data mostprobable is to be preferred. Because of this, there is a closeconnection with the hierarchical Bayesian modelling referred toearlier (Gelman 2013). The central notion in the Bayesian modelselection tools is thus the marginal likelihood of the model, i.e.,the weighted average of the likelihoods over the model, using theprior distribution as a weighing function:\[ P(s \mid M_{i}) \; = \; \int_{\theta \in \Theta_{i}} P(h_{\theta}) P(s\mid h_{\theta}) d\theta . \]Here \(\Theta_{i}\) is the parameter space belonging to model\(M_{i}\). The marginal likelihoods can be combined with a priorprobability over models, \(P(M_{i})\), to derive theso-calledposterior model probability, using Bayes'theorem. One way of evaluating models, known asBayesian modelselection, is by comparing the models on their marginallikelihood, or else on their posteriors (cf. Kass and Raftery1995).
Usually the marginal likelihood cannot be computed analytically.Numerical approximations can often be obtained, but for practicalpurposes it has proved very useful, and quite sufficient, to employ anapproximation of the marginal likelihood. This approximation hasbecome known as theBayesian information criterion, or BICfor short (Schwarz 1978, Raftery 1995). It turns out that thisapproximation shows remarkable similarities to the AIC:\[ \text{BIC}[M] \; = \; - 2 \log P(s \mid h_{\hat{\theta}(s)}) + d \logn . \]Here \(\hat{\theta}(s)\) is again the maximum likelihood estimate ofthe model, \(d = dim(M)\) the number of independent parameters, and\(n\) is the number of data points in the sample. The latterdependence is the only difference with the AIC, but a major differencein how the model evaluation may turn out.
The concurrence of the AIC and the BIC seems to give a furthermotivation for our intuitive preference for simple models over morecomplex ones. Indeed, other model selection tools, like thedeviance information criterion (Spiegelhalter et al 2002) andthe approach based onminimum description length (Grunwald2007), also result in expressions that feature a term that penalizescomplex models. However, this is not to say that the dimension termthat we know from the information criteria exhausts the notion ofmodel complexity. There is ongoing debate in the philosophy of scienceconcerning the merits of model selection in explications of the notionof simplicity, informativeness, and the like (see, for example, Sober2004, Romeijn and van de Schoot 2008, Romeijn et al 2012, Sprenger2013).
There are also statistical methods that refrain from the use of aparticular model, by focusing exclusively on the data or bygeneralizing over all possible models. Some of these techniques areproperly localized in descriptive statistics: they do not concern aninference from data but merely serve to describe the data in aparticular way. Statistical methods that do not rely on an explicitmodel choice have unfortunately not attracted much attention in thephilosophy of statistics, but for completeness sake they will bebriefly discussed here.
One set of methods, and a quite important one for many practicingstatisticians, is aimed atdata reduction. Often the sampledata are very rich, e.g., consisting of a set of points in a space ofvery many dimensions. The first step in a statistical analysis maythen be to pick out the salient variability in the data, in order toscale down the computational burden of the analysis itself.
The technique ofprincipal component analysis (PCA) isdesigned for this purpose (Jolliffe 2002). Given a set of points in aspace, it seeks out the set of vectors along which the variation inthe points is large. As an example, consider two points in a planeparameterized as \((x, y)\): the points \((0, 0)\) and \((1, 1)\). Inthe \(x\)-direction and in the \(y\)-direction the variation is \(1\),but over the diagonal the variation is maximal, namely\(\sqrt{2}\). The vector on the diagonal is called the principalcomponent of the data. In richer data structures, and using a moregeneral measure of variation among points, we can find the firstcomponent in a similar way. Moreover, we can repeat the procedureafter subtracting the variation along the last found component, byprojecting the data onto the plane perpendicular to thatcomponent. This allows us to build up a set of principal components ofdiminishing importance.
PCA is only one item from a large collection of techniques that areaimed at keeping the data manageable and finding patterns in it, acollection that also includeskernel methods andsupportvector machines (e.g., Vapnik and Kotz 2006). For presentpurposes, it is important to stress that such tools should not beconfused with statistical analysis: they do not involve the testing orevaluation of distributions over sample space, even though they buildup and evaluate models of the data. This sets them apart from, e.g.,confirmatory and exploratory factor analysis (Bartholomew 2008), whichis sometimes taken to be a close relative of PCA because bothsets of techniques allows us to identify salient dimensions withinsample space, along which the data show large variation.
Practicing statisticians often employ data reduction tools toarrive at conclusions on the distributions from which the data weresampled. There is already a wide use for machine learning and datamining techniques in the sciences, and we may expect even mode usageof these techniques in the future, because so much data is now comingavailable for scientific analysis. However, in the philosophy ofstatistics there is as yet little debate over the epistemic status ofconclusions reached by means of these techniques. Philosophers ofstatistics would do well to direct some attention here.
An entirely different approach to statistics is presented byformal learning theory.This is again a vast area of research, primarily located incomputer science and artificial intelligence. The discipline is herementioned briefly, as another example of an approach to statisticsthat avoids the choice of a statistical model altogether and merelyidentifies patterns in the data. We leave aside the theory ofneural networks, which also concerns predictive systems thatdo not rely on a statistical model, and focus on the theory oflearning algorithms because of all these approaches they have seenmost philosophical attention.
Pioneering work on formal learning was done by Solomonoff(1964). As before, the setting is one in which the data consist ofstrings of 0's and 1's, and in which an agent is attempting toidentify the pattern in these data. So, for example, the data may be astring of the form \(0101010101\ldots\), and the challenge is toidentify this strings as an alternating sequence. The central idea ofSolomonoff is that all possible computable patterns must be consideredby the agent, and therefore that no restrictive choice on statisticalhypotheses is warranted. Solomonoff then defined a formal system inwhich indeed all patterns can be taken into consideration, effectivelyusing a Bayesian analysis with a cleverly constructed prior over allcomputable hypotheses.
This general idea can also be identified in a rather new field onthe intersection of Bayesian statistics and machine learning,Bayesiannonparametrics (e.g., Orbanz and Teh2010, Hjort et al 2010). Rather than specifying, at the outset, aconfined set of distributions from which a statistical analysis issupposed to choose on the basis of the data, the idea is that the dataare confronted with a potentially infinite-dimensional space ofpossible distributions. The set of distributions taken intoconsideration is then made relative to the data obtained: thecomplexity of the model grows with the sample. The result is apredictive system that performs an online model selection alongside aBayesian accommodation of the posterior over the model.
Current formal learning theory is a lively field, to whichphilosophers of statistics also contribute (e.g., Kelly 1996, Kelly etal 1997). Particularly salient for the present concerns is that thesystems of formal learning are set up to achieve some notion ofadequateuniversal prediction, without confining themselvesto a specific set of hypotheses, and hence by imposing minimalconstraints on the set of possible patterns in the data. It is amatter of debate whether this is at all possible, and to what extentthe predictions of formal learning theory thereby rely on, e.g.,implicit assumptions on structure of the sample space. Philosophicalreflection on this is only in its infancy.
There are numerous topics in the philosophy of science that beardirect relevance to the themes covered in this lemma. A few centraltopics are mentioned here to direct the reader to related lemmas inthe encyclopedia.
One very important topic that is immediately adjacent to thephilosophy of statistics isconfirmation theory, the philosophical theory that describes and justifiesrelations between scientific theory and empiricalevidence. Arguably, the theory of statistics is a proper part ofconfirmation theory, as it describes and justifies the relation thatobtains between statistical theory and evidence in the form ofsamples. It can be insightful to place statistical procedures in thiswider framework of relations between evidence and theory. Zooming outeven further, the philosophy of statistics is part of thephilosophical topic of methodology, i.e., the general theory onwhether and how science acquires knowledge. Thus conceived, statisticsis one component in a large collection of scientific methodscomprising concept formation, experimental design, manipulation andobservation, confirmation, revision, and theorizing.
There are also a fair number of specific topics from the philosophyof science that are spelled out in terms of statistics or that arelocated in close proximity to it. One of these topics is the processof measurement, in particular the measurement of latent variables onthe basis of statistical facts about manifest variables. Theso-calledrepresentational theory of measurement (Kranz et al1971) relies on statistics, in particular on factor analysis, toprovide a conceptual clarification of how mathematical structuresrepresent empirical phenomena. Another important topic form thephilosophy of science is causation (see the entries onprobabilistic causationandReichenbach's common cause principle). Philosophers have employed probability theory to capture causalrelations ever since Reichenbach (1956), but more recent work incausality and statistics (e.g., Spirtes et al 2001) has given thetheory ofprobabilistic causality an enormous impulse. Hereagain, statistics provides a basis for the conceptual analysis ofcausal relations.
And there is so much more. Several specific statisticaltechniques, like factor analysis and the theory of Bayesian networks,invite conceptual discussion of their own accord. Numerous topicswithin the philosophy of science lend themselves to statisticalelucidation, e.g., the coherence, informativeness, and surprise ofevidence. And in turn there is a wide range of discussions in thephilosophy of science that inform a proper understanding ofstatistics. Among them are debates over experimentation andintervention, concepts of chance, the nature of scientific models, andtheoretical terms. The reader is invited to consult the entries onthese topics to find further indications of how they relate to thephilosophy of statistics.
How to cite this entry. Preview the PDF version of this entry at theFriends of the SEP Society. Look up this entry topic at theInternet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entryatPhilPapers, with links to its database.
[Please contact the author with suggestions.]
belief, formal representations of |causation: probabilistic |confirmation |defaults in semantics and pragmatics |evidence |induction: problem of |learning theory, formal |logic: and probability |logic: inductive |probability, interpretations of |reasoning: defeasible |Reichenbach, Hans: common cause principle |scientific method |simplicity |skepticism: ancient
View this site from another server:
The Stanford Encyclopedia of Philosophy iscopyright © 2016 byThe Metaphysics Research Lab, Center for the Study of Language and Information (CSLI), Stanford University
Library of Congress Catalog Data: ISSN 1095-5054