Philosophy of Statistics

First published Tue Aug 19, 2014; substantive revision Wed Oct 1, 2025

Statistics investigates and develops specific methods for evaluatinghypotheses in the light of empirical facts. A method is calledstatistical, and thus the subject of study in statistics, if itrelates facts and hypotheses of a particular kind: the empirical factsmust be codified and structured into data sets, and the hypothesesmust be formulated in terms of probability distributions over possibledata sets. The philosophy of statistics concerns the foundations andthe proper interpretation of statistical methods, their input, andtheir results. Since statistics is relied upon in almost all empiricalscientific research, serving to support and communicate scientificfindings, the philosophy of statistics is of key importance to thephilosophy of science. It is part of the philosophical appraisal ofscientific method, and it impacts the debate over the epistemic andontological status of scientific theory.

The philosophy of statistics harbors a large variety of topics anddebates. Central to these is theproblem of induction, which concerns the justification of inferences or procedures thatextrapolate from data to predictions and general facts. Furtherdebates concern theinterpretation of the probabilities that are used in statistics, and the wider theoretical framework thatmay ground and justify the correctness of statistical methods. Ageneral introduction to these themes is given inSection 1 andSection 2.Section 3 andSection 4 provide an account of how these themes play out in the two majortheories of statistical method, classical and Bayesian statisticsrespectively.Section 5 directs attention to the notion of a statistical model, coveringmodel selection and simplicity, but also discussing statistical anddata-scientific techniques that do not rely on statistical models.Section 6 briefly mentions relations between the philosophy of statistics andseveral other themes from the philosophy of science, includingconfirmation theory,evidence, causality, measurement, and scientific methodology in general.

1. Statistics and induction

Statistics is a mathematical and conceptual discipline that focuses onthe relation between data and hypotheses. The data are recordings ofobservations or events in a scientific study, e.g., a set ofmeasurements of individuals from a population. The data actuallyobtained are variously called the sample, the sample data, or simplythe data, and all possible samples from a study are collected in whatis called a sample space. The hypotheses, in turn, are generalstatements about the target system of the scientific study, e.g.,expressing some general fact about all individuals in the population.A statistical hypothesis is a general statement that can be expressedby a probability distribution over sample space, i.e., it determines aprobability for each of the possible samples.

Statistical methods provide the mathematical and conceptual means toevaluate statistical hypotheses in the light of a sample. To this endthe methods employ probability theory, and occasionallygeneralizations thereof. The evaluations may determine how believablea hypothesis is, whether we may rely on the hypothesis in ourdecisions, how strong the support is that the sample gives to thehypothesis, and so on. Good introductions to statistics abound (e.g.,Mood and Graybill 1974, Barnett 1999, Press 2002, Wasserman 2004,Gelman et al. 2013).

We can set the stage with an example, adapted from Fisher (1935) witha nod to Student’s t-test.

The tea tasting student
Consider a student who claims that they can, by taste, determine theorder in which milk and tea were poured into the cup. Now imagine thatwe prepare five cups of tea, tossing a fair coin to determine theorder of milk and tea in each cup. We ask the student to pronounce theorder, and we find that she is correct in all cases! Now if she isguessing the order blindly then, owing to the random way we preparethe cups, she will answer correctly 50% of the time. This is ourstatistical hypothesis, referred to as the null hypothesis. It gives aprobability of $1/2$ to a correct guess and hence a probability of$1/2$ to an incorrect one. The sample space consists of allsequences of answers the student might give, i.e., all sequences ofcorrect and incorrect guesses. But our actual data sits in a ratherspecial corner in this space. On the assumption of our statisticalhypothesis, the probability of the recorded events is a mere 3%, or$1/2^{5}$ more precisely. On this ground, we may decide to rejectthe hypothesis that the student is guessing: the probability of theevent of five correct guesses is too low to retain it.

According to the so-called null hypothesis test, such a decision iswarranted if the data actually obtained are included in a particularregion within sample space, whose total probability does not exceedsome specified limit, standardly set at 5%. Now consider what isachieved by the statistical test just outlined. We started with ahypothesis on the actual tea tasting abilities of the student, namely,that she did not have any. On the assumption of this hypothesis, thesample data we obtained appears to be very surprising or, moreprecisely, highly improbable. This motivated us to reject thehypothesis that the student has no tea tasting abilities whatsoever.The sample thus points us to a negative but general conclusion aboutwhat the student can, or cannot, do.

Notably, all individual sequences of correct and incorrect guesses areequally improbable according to the null hypothesis: they all have aprobability of $1/2^5$. But when we collect the possible samples insets that have the same total number of correct guesses, the singletonset of five correct guesses presents itself as an outlier. It containsonly one sequence and it is therefore less probable than the setcontaining all five sequences with a total of four correct guesses,and far less probable than the sets containing two or three correctguesses, which each consist of ten sequences. Marking out the onesample of five correct guesses as the improbable event thus hinges ona specific way of labelling the samples. The function on sample spacethat determines this labelling is usually the so-called sufficientstatistic, i.e., the function that groups sequences that areequiprobable according to the statistical hypotheses underscrutiny.

The basic pattern of a statistical analysis is familiar from inductiveinference: we input the data obtained thus far, and the statisticalprocedure outputs a verdict or evaluation that transcends the data,i.e, a statement that is not entailed by the data alone. If the dataare indeed considered to be the only input, and if the statisticalprocedure is understood as an inference, then statistics is concernedwith ampliative inference: roughly speaking, we get out more than wehave put in. And since the ampliative inferences of statistics pertainto future or general states of affairs, they are inductive. However,the association of statistics with ampliative and inductive inferenceis contested, both because statistics is considered to benon-inferential by some (seeSection 3) and non-ampliative by others (seeSection 4).

Despite such disagreements, it is insightful to view statistics as aresponse to the problem of induction (cf. Howson 2000 and the entry ontheproblem of induction). This problem, first discussed by Hume in hisTreatise of HumanNature (Book I, part 3, section 6) but prefigured already byancient sceptics like Sextus Empiricus (see the entry onancient skepticism), is that there is no proper justification for inferences that run fromgiven experience to expectations about the future. Transposed to thecontext of statistics, it reads that there is no proper justificationfor procedures that take data as input and that return a verdict, anevaluation, or some other piece of advice that pertains to the future,or to general states of affairs. Arguably, much of the philosophy ofstatistics is about coping with this challenge, by providing afoundation of the procedures that statistics offers, or else byreinterpreting what statistics delivers so as to evade thechallenge.

It is debatable if philosophers of statistics are ultimately concernedwith the delicate, somewhat ethereal issue of the justification ofinduction, and statisticians generally are not. Many philosophers andscientists simply accept the fallibility of statistics, and find itmore important that statistical methods are understood and appliedcorrectly. As is so often the case, the fundamental philosophicalproblem serves as a catalyst: the problem of induction guides ourinvestigations into the workings, the correctness, and the conditionsof applicability of statistical methods. The philosophy of statistics,understood as the general header under which these investigations arecarried out, is not, or not primarily, concerned with philosophy forits own sake. Rather it presents a concrete contribution to thescientific method, and hence to science itself. Considering thecentrality of statistical methods in practically every empiricalscience and the ongoing debates over their validity, for instance overthe reproducibility of experimental findings in social and medicalscience (Ioannidis 2005), this kind of applied philosophical work isof vital importance.

2. Foundations and interpretations

While there is large variation in how statistical procedures andinferences are organized, they all agree on the use of modernmeasure-theoretic probability theory (Kolmogorov 1933), or a near kin,as the means to express hypotheses and relate them to data. By itself,a probability function is simply a particular kind of mathematicalfunction, used to express the size or measure of a set (cf.Billingsley 1995).

Let $W$ be a set with elements $s$, and consider an initialcollection of subsets of $W$, e.g., the singleton sets $\{ s \}$.Now consider the operation of taking the complement $\bar{R}$ of agiven set $R$: this complement $\bar{R}$ contains exactly and allthose $s$ that are not included in $R$. Next consider the join $R\cup Q$ given sets $R$ and $Q$: an element $s$ is a member of$R \cup Q$ precisely when it is a member of $R$, or of $Q$, orof both. The collection of sets generated by combining and iteratingthe operations of complement and join is called an algebra, denoted$S$. In statistics we interpret $S$ as the set of samples, and wecan associate sets $R$ with specific events or observations. Aspecific sample $s$ includes a record of the event denoted with$R$ exactly when $s \in R$. We take the algebra of sets like $R$as a language for making claims about the samples.

A probability function is defined as an additive normalized measure onthe algebra: a function

\[ P: {\cal S} \rightarrow [0, 1] \]

such that $P(W) = 1$ and $P(R \cup Q) = P(R) + P(Q)$ if $R \cap Q= \emptyset$. The conditional probability $P(Q \mid R)$ is definedas

\[ P(Q \mid R) \; = \; \frac{P(Q \cap R)}{P(R)} , \]

whenever $P(R) > 0$. It determines the relative size of the set$Q$ within the set $R$. It is often read as: the probability ofthe event $Q$ given that the event $R$ occurs. Recall that the set$R$ consists of all samples $s$ that include a record of the eventassociated with $R$. By looking at $P(Q \mid R)$ we zoom in on theprobability function within this set $R$, i.e., we consider thecondition that the associated event occurs.

What does the probability function mean? The mathematical notion ofprobability does not provide a complete answer. The function $P$ maybe interpreted as

ontic, namely the frequency or propensity of the occurrence of astate of affairs, often referred to as the chance, or else as
epistemic, namely the degree of belief in the occurrence of thestate of affairs, the willingness to act on its assumption, a degreeof support or confirmation, or similar, often referred to as thecredence.

This distinction should not be confused with that between objectiveand subjective probability. Both ontic and epistemic probability canbe given an objective and subjective character, in the sense that bothcan be taken as dependent on, or independent of a knowing subject andher conceptual apparatus. For more details on the interpretation ofprobability, the reader may consult a large literature, includingGalavotti (2005), Gillies (2000), Mellor (2005), von Plato (1994), theanthology by Eagle (2010), the handbook of Hajek and Hitchcock (2016),or indeed the entry oninterpretations of probability. In this context the key point is that the interpretations can all beconnected to foundational programmes for statistical procedures.Although the match is far from exact, the two major types specifiedabove can be associated with the two major theories of statistics,classical and Bayesian statistics, respectively.

2.1 Chance and classical statistics

In the sciences, the idea that probabilities express states ofaffairs, i.e., properties of chance setups or stochastic processes, isrelatively prominent. The probabilities express relative frequenciesin series of events or, alternatively, tendencies or propensities inthe systems that realize those events. More precisely, the probabilityattached to the property of an event type can be understood as thefrequency or tendency with which that property manifests in a seriesof events of that type. For instance, the probability of a coinlanding heads is a half exactly when in a series of similar cointosses, the coin lands heads half the time. Or alternatively, theprobability is half if there is an even tendency towards both possibleoutcomes in the setup of the coin tossing. The mathematician Venn(1888) and scientists like Quetelet and Maxwell (cf. von Plato 1994)are early proponents of this way of viewing probability, although thebasic conception of chance goes as far back as Huygens (1657).Philosophical theories of propensities were first coined by Peirce(1910), and developed by Popper (1959), Mellor (1971), Giere (1976),and Bigelow (1977); see Handfield (2012) for an overview. A rigouroustheory of probability as frequency was first devised by von Mises(1981), also defended by Reichenbach (1938) and beautifully expoundedin van Lambalgen (1987). A highly readable overview of ideas aboutchance is offered by Diaconis and Skyrms (2018).

This ontic conception of probability is connected to one of the majortheories of statistical method, which is often called classicalstatistics but which, more neutrally, might be termed Bernoullian, inreference to its early appearance in Bernoulli’sArsConjectandi (1713). It was developed roughly in the first half ofthe 20th century, mostly by mathematicians and working scientists likeFisher (1925, 1935, 1956), Wald (1939, 1950) and Neyman and Pearson(1928, 1933, 1967), and refined by very many statisticians of the lastfew decades. The key characteristic of this theory of statisticsaligns naturally with viewing probabilities as chances, hencepertaining to observable and repeatable events. Ontic probabilitycannot meaningfully be attributed to statistical hypotheses, sincehypotheses do not have tendencies to occur, and hence no frequencieswith which they come about: they are categorically true or false, onceand for all. Attributing probability to a hypothesis entails that theprobability is understood epistemically.

Classical or Bernoullian statistics is often called frequentist, owingto the centrality of frequencies of events in its procedures and theprominence of the frequentist interpretation of probability developedby von Mises. In this interpretation, chances are identified withfrequencies, or proportions in a class of similar events or items.They are best thought of as analogous to other physical quantities,like mass and energy. It deserves emphasis that frequencies are thusconceptually prior to chances. In propensity theory the probability ofan individual event or item is viewed as a tendency in nature, so thatthe frequencies, or the proportions in a class of similar events oritems, manifest as a consequence of the law of large numbers. In thefrequentist theory, by contrast, the proportions lay down, indeeddefine what the chances are. This leads to a central problem forfrequentist probability, the so-called reference class problem: it isnot clear what class to associate with an individual event or item(cf. Reichenbach 1949, Hajek 2007). One may argue that the class needsto be as narrow as it can be, offering a maximally precise descriptionof the event type at issue. But in the extreme case of a singletonclass of events, the chances of course trivialize to zero or one.Since classical statistics employs non-trivial probabilities thatattach to the single case in its procedures, a fully frequentistunderstanding of statistics is arguably in need of a response to thereference class problem.

To illustrate ontic probability in classical statistics, we brieflyconsider the frequentist interpretation in the example of the teatasting student.

Frequentist interpretation
We denote the null hypothesis that the student is merely guessing by$h$. Say that we follow the rule indicated in the example above: wereject the null hypothesis, i.e., deny that the student is merelyguessing, whenever the sampled data $s$ is included in a particularset $R$ of possible samples, i.e., when $s \in R$. Furthermore theset of samples $R$ has a summed probability of 3% according to thenull hypothesis. Now imagine that we are supposed to judge a wholepopulation of tea tasting students, scattered in tea rooms throughoutthe country. Then, by running the experiment in a large number of tearooms and adopting the rule just cited, we know that we will falselyattribute special tea tasting talents to 3% of those students for whomthe null hypothesis is true, i.e., who are in fact merely guessing.Alternatively, we may imagine a single student without special teatasting talents, being tested by a population of scientists who alluse the same null hypothesis test. In that case 3% of scientists willfalsely attribute special tea tasting talents to the student. Eitherway, the percentage pertains to a frequency within a particular set ofevents, which by the rule is connected to a particular error injudgment.

Say that we have found a student for whom we reject the nullhypothesis, i.e., a student who passes the test. Does she have the teatasting ability or not? Unfortunately this is not the sort of questionthat can be answered by the test at hand. A good answer wouldpresumably involve the proportion of students who indeed have thespecial tea tasting ability among those whose scores exceeded acertain threshold, e.g., those who answered correctly on all fivecups. But this latter proportion, namely of students for whom the nullhypothesis is false among all those students who passed the test,cannot be determined without further assumptions. It will depend alsoon the proportion of students who have the ability in the populationon the whole: if there are many of them around, it is more probablethat one who passed the test indeed has the tea tasting talents. Butthe null hypothesis test only involves proportions within a group ofstudents for whom the null hypothesis is assumed to be true. Thisholds in general for frequentist statistics: it only considersprobabilities for events, or chances for short, under the assumptionthat the events are distributed in a given way, and it only involvesthe observable consequences of these chances, namely thefrequencies.

2.2 Credence and statistical inference

There is an alternative way of viewing the probabilities that appearin statistical methods: they can be seen as expressions of epistemicattitudes, orcredences for short. We are again facingseveral interrelated options.

2.2.1 Types of credence

Credences are often taken as doxastic in the sense that they specifyopinions about data and hypotheses of an idealized rational agent.They express the strength or degree of belief, for instance regardingthe correctness of the next guess of the tea tasting student.Credences are often framed in a decision-theoretic way, namely as partof a more elaborate representation of the agent’s dispositionstowards decisions and actions. Such a decision-theoreticrepresentation involves credences alongside preferential attitudes andother furnishing of inner life. The norms for credences accordinglyderive from requirements on how credences guide our actions, oftenspelled out in a specific pragmatist way. Credences are taken toexpress a willingness to engage in collections of bets: the credencein the occurrence of an event is given by the price of a bettingcontract that pays out one monetary unit if the event manifests.So-called Dutch book arguments then constrain credences: therequirement that the agent does not expose herself to sure lossentails that the credences must comply to the axioms of probabilitytheory (cf. Jeffrey 1992).

There are alternatives to this pragmatist take on doxastic credence.Doxastic credences might instead pertain to beliefs in a stand-alonefashion, separate from decisions or actions. In this case the normsfor the credences can be derived from the requirement that the beliefshave to be accurate, i.e., close to the truth values for the occurenceor non-occurence of the events under scrutiny. By relying on specificassumptions about proximity to truth values, we can provide so-callednon-pragmatist vindications of the probability axioms for credences.An early argument of this kind can be found in de Finetti (1974) butfurther discussions and extensions are proposed by for instance Joyce(1999) and Leitgeb and Pettigrew (2010a and 2010b).

Within the doxastic conception of credence we can make a furthersubdivision into subjective and objective doxastic credences. Thedefining characteristic of an objective doxastic credence is that itis constrained by further rationality criteria, or else by the demandthat the beliefs are calibrated to an objective fact or state ofaffairs. A prominent rationality criterion is that equal possibilitiesreceive equal credence, known as the princple of indifference. Awell-known calibration requirement, expressed in a variety ofso-called chance-credence principles, is that the credences align withfrequencies of events, or chance ascriptions to events. A subjectivedoxastic attitude, by contrast, is not constrained in such a way:agents are free to believe as they see fit, as long as they comply tothe probability axioms.

Credences may also be taken as logical. More precisely, probabilitytheory itself may be taken as a kind of logic, i.e., a formalstructure that describes valid inference. In this logical approachprobability values over data and hypotheses have a role that iscomparable to the role of truth values in Boolean logic. The axioms ofprobability impose coherence constraints on probability values, muchlike the rules of logic impose constraints on truth valuations.Accordingly, we can derive the norms for credences from an independentconception of valid inference. In particular, compliance to theprobability axioms can be derived from natural desiderata for gradedbeliefs (e.g., Cox 1961 and Howson 2000). The logical conception ofcredence is distinct from the doxastic one because the logicalconstructions do not carry any reference to a psychological reality,in the same way that logic does not represent thought.

The epistemic view on probability has a long history, starting inearly work by Pascal (cf. Hacking 2006). It was substantiallydeveloped in the 19th and the first half of the 20th century, first bythe hand of De Morgan (1847) and Boole (1854), later by Keynes (1921),Ramsey (1926) and de Finetti (1937), and by decision theorists,philosophers and inductive logicians such as Carnap (1950), Savage(1962), Levi (1980), and Jeffrey (1992). Important proponents of theseviews in statistics were Jeffreys (1961), Edwards (1972), Lindley(1965), Good (1983), Jaynes (2003) as well as Bayesian philosophersand statisticians of the last few decades (Dawid 2004, Berger 2006,Goldstein 2006, Kadane 2011 among many others). All of these have aview that places probabilities somewhere in the realm of the epistemicrather than the ontic, i.e., not as part of the world itself butrather as a means to model our beliefs about the world.

2.2.2 Statistical inference

For present concerns the important point is that each of theseepistemic interpretations of the probability calculus comes with itsown set of foundational programs for statistics. On the whole,epistemic probability is most naturally associated with the secondmajor theory of statistical methods, called Bayesian statistics (Press2002, Berger 2006, Gelman et al 2013), in reference to its inventorThomas Bayes. The key characteristic of Bayesian statistics flowsdirectly from the epistemic interpretation: under this interpretationit makes sense to assign probability to a statistical hypothesis, asan expression of how strongly we believe the hypothesis. Bayesianstatistics allows us to relate credences over statistical hypothesesto the chances of events, so that we can express how our credencesover statistical hypotheses change under the impact of observations.The result is a formal representation of statistical inference, i.e.,a framework for deriving credences over hypotheses in the light ofdata.

To illustrate the epistemic conception of probability in Bayesianstatistics, we return to the example of the tea tasting student.

Bayesian inference
As before we denote the null hypothesis that the student is guessingrandomly with $h$. The distribution $P_{h}$ assigns a probabilityof 1/2 to any guess made by the student. The alternative $h'$ isthat the student performs better than a fair coin. To keep matterssimple, we might stipulate that the distribution $P_{h'}$ gives aprobability of 3/4 to a correct guess. At the outset we might find itrather improbable that the student has special tea tasting abilities.To express this we give the hypothesis of her having these abilitiesonly half the probability that we allocate to her not having any suchabilities: $P(h') = 1/3$ and $P(h) = 2/3$. Notice that we need anepistemic conception of probability to make sense of this last step:we assign a credence to the hypotheses, which themselves imposechances over the events, or over the samples representing them. Now,leaving the mathematical details toSection 4.1, after receiving the sample data $s$ that the student guessed allfive cups correctly, our new credence in her special abilities,denoted $P_{s}$, has more than reversed. We now think it roughlyfour times more probable that the student has the special abilitiesthan that she is merely a random guesser: $P_{s}(h') = 243/307\approx 4/5$ and $P_{s}(h) \approx 1/5$.

We express our epistemic attitudes towards statistical hypotheses interms of credences, and the data then impact on these credences in aregulated fashion. The process of adapting credences to data is termedBayesian statistical inference.

It deserves emphasis that Bayesian statistics is not the sole user ofan epistemic notion of probability. A frequentists understanding ofprobabilities assigned to statistical hypotheses seems nonsensical butit is perfectly possible to interpret the probabilities of events, orthe elements in sample space that represent the events, as epistemic,quite independently of the statistical method that is being used. Asfurther explained in the next section, several philosophicaldevelopments of classical statistics employ epistemic probability,most notably fiducial probability (Fisher 1955 and 1956; see alsoSeidenfeld 1992 and Zabell 1992), likelihoodism (Hacking 1965, Edwards1972, Royall 1997), and evidential probability (Kyburg 1961), or elseconnect the procedures of classical statistics to inference andsupport in some other way (e.g., Mayo 1996). In these developments ofclassical statistics, probabilities and functions over sample spaceare to some extent read epistemically, i.e., as expressions of thestrength of evidence, the degree of support, or similar.

3. Classical statistics

The collection of procedures that may be grouped under classicalstatistics is vast and diverse. By and large, classical statisticalprocedures share the feature that they only rely on probabilityassignments over sample spaces. An important motivation for this isthat those probabilities can be interpreted as frequencies, or aschances expressed in frequencies, from which the term of frequentiststatistics originates. Classical statistical procedures are typicallydefined by some function over sample space, where this functiondepends, often exclusively, on the distributions that the hypothesesunder consideration assign to the sample space. Over the domain ofsamples that may be obtained, an estimation function points to oneelement from a range of hypotheses, or perhaps to a set of them, asbeing in some sense the best fit with that sample. A test function, bycontrast, will point to a candidate hypothesis that renders the sampletoo improbable and thus a candidate for rejection.

In sum, classical procedures employ the data to narrow down a set ofhypotheses. Put in such general terms, it becomes apparent thatclassical procedures provide a response to the problem of induction.The data are used to get from a weak general statement about thetarget system to a stronger one, namely from a set of candidatehypotheses to a subset of them, possibly a singleton. A centralconcern in the philosophy of statistics is how we are to understandsuch procedures, and how we might justify them. The pattern ofclassical statistics resembles that of eliminative induction: in viewof the data we discard some of the candidate hypotheses. Indeedclassical statistics is often seen in loose association withPopper’s falsificationism, but this association is somewhatmisleading. In classical procedures statistical hypotheses arediscarded when they render the observed sample too improbable, whichof course differs from discarding hypotheses that deem the observedsample impossible.

3.1 Basics of classical statistics

The foregoing already provided a short example and a rough sketch ofclassical statistical procedures. These are now specified in moredetail, on the basis of Barnett (1999) as primary source. Thefollowing focuses on two very central procedures, hypothesis testingand estimation. The first has to do with the comparison of twostatistical hypotheses, and invokes theory developed by Neyman andPearson. The second concerns the choice of a hypothesis from a set,and employs procedures devised by Fisher.

3.1.1 Hypothesis testing

The procedure of Fisher’s null hypothesis test was alreadydiscussed briefly in the foregoing. Let $h$ be the hypothesis ofinterest and, for the sake of simplicity, let $S$ be a finite samplespace. The hypothesis $h$ imposes a distribution over the samplespace, denoted $P_{h}$. Every point $s$ in the space represents apossible sample of data. We now define a function $F$ on the samplespace that identifies when we will reject the null hypothesis bymarking the samples $s$ that lead to rejection with $F(s) = 1$, asfollows:

\[ F(s) = \begin{cases}1 \quad \text{if } P_{h}(s) < q,\\ 0 \quad \text{otherwise.} \end{cases} \]

Notice that the rejection of $h$ hinges on the probability of thedata under the assumption of the hypothesis, $P_{h}(s)$. Thisexpression is often called the likelihood of the hypothesis for thesample $s$. We can collect all samples $s$ for which thelikelihood drops below the threshold $q$ in a so-called region ofrejection, $R_{q} = \{ s:\: F(s) = 1 \}$. The threshold $q$ can bedetermined by requiring that the total probability of the region ofrejection $R_{q}$ is below a given level of error, $P_{h}(R_{q})< \alpha$. A common choice is $\alpha = 0.05$.

In standard null hypothesis testing little can be said about errorrates if the null hypothesis is in fact false. In response to thisNeyman and Pearson (1928, 1933, and 1967) devised the so-calledlikelihood ratio test, a test that compares the likelihoods of tworivaling hypotheses. Let $h$ and $h'$ be the null and thealternative hypothesis respectively. We can compare these hypothesesby the following test function $F$ over the sample space:

\[ F(s) = \begin{cases}1 \quad \text{if } \frac{P_{h'}(s)}{P_{h}(s)} > r,\\ 0 \quad \text{otherwise,} \end{cases} \]

where $P_{h}$ and $P_{h'}$ are the probability distributions overthe sample space determined by the statistical hypotheses $h$ and$h'$ respectively. If $F(s) = 1$ we decide to reject the nullhypothesis $h$, else we accept $h$ for the time being and sodisregard $h'$. A region of rejection $R_{r}$ can be definedanalogously to $R_{q}$ above.

The decision to accept or reject a hypothesis is associated with thepossibility of error. Especially Neyman has become known forinterpreting this in a strictly behaviorist fashion. For furtherdiscussion on this point, seeSection 3.2.2. We commit a so-called type-I error if we reject the null hypothesiswhen in fact it is true. So for a given test function, the probabilityof a type-I error, often denoted with $\alpha$, is the probability,according to the null hypothesis $h$, of obtaining data within theregion of rejection, i.e., data that leads us to falsely reject thishypothesis $h$:

\[ \alpha = P_{h}(R_{r}) = \sum_{s \in S} F(s) P_{h}(s) . \]

The probability $\alpha$ is alternatively called the significancelevel of the test. We can also make the converse type-II error ofaccepting the null hypothesis when in fact the alternative is true.The probability of a type-II error, often denoted $\beta$, is theprobability, according to the alternative hypothesis $h'$, ofobtaining data outside of the region of rejectioni.e., data that leadsus to falsely accept the null hypothesis $h$:

\[\beta = 1 - P_{h'}(R_{r}) = \sum_{s \in S} F(s) P_{h'}(s) .\]

The probability $1 - \beta$ is alternatively called the power of thetest. An optimal test is one that minimizes both the errors $\alpha$and $\beta$, and hence minimizes significance level while maximizingpower.

In their fundamental lemma, Neyman and Pearson proved that thedecision has optimal significance and power for, and only for,likelihood-ratio test functions $F$. That is, an optimal testdepends only on a threshold for the ratio $P_{h'}(s) / P_{h}(s)$.The example of the tea tasting student allows for an easy illustrationof the likelihood ratio test.

Neyman-Pearson test
Next to the null hypothesis $h$ that the student is randomlyguessing, we now consider the alternative hypothesis $h'$ that shehas a chance of $3/4$ to guess the order of tea and milk correctly.The samples $s$ are binary 5-tuples that record guesses as correctand incorrect. To determine the likelihoods of the two hypotheses, andthereby the value of the test function for each sample, we only needto know the so-called sufficient statistic, in this case the number ofcorrect guesses $k$ out of the total number $n$, independently ofthe order. We can collect all samples $s$ in which $k$ out of$n$ guesses are correct in a set of sequences, denoted by$S_{k/n}$. All sequences within these sets are equally probableaccording to the hypotheses under scrutiny. In the example we observe$n = 5$ guesses, so we have $P_{h}(S_{k/5}) = \binom{5}{k}1/2^{5}$and $P_{h'}(S_{k/5}) = \binom{5}{k} 3^{k} / 4^{5}$. For anyindividual sample $s$ and for the aforementioned sets of sequences$S_{k/n}$, the likelihood ratio becomes $3^{k} / 2^{5}$. If werequire that the significance level is lower than 5%, then it can becalculated that only the set of samples with $k = 5$ may be includedin the region of rejection. Accordingly we may set the cut-off point$r$ at $r = 3^{4} / 2^{5}$. Upon finding that the student guessesall five cups correctly, we reject the null hypothesis with 5%significance.

Notice that the construction of a test function relies on the adoptionof a specific structuring of the sample space. Samples that areequally probable according to the hypotheses under scrutiny arecollected into sets that correspond to observations of a sufficientstatistic, and the region of rejection is then defined in terms ofthis statistic. In some cases, e.g., so-called one-sided tests, thetest function relies on a further structuring of the sample space.

Another expression of the error of a test, closely related to butdifferent from the significance level, is the so-called p-value(Schervish 1996). After determining a test function and region ofrejection by requiring a certain significance level $\alpha$, as inthe foregoing, we might obtain data well inside this region, so thatit is also included in a region of rejection that is much smaller. Thep-value of a given sample and test is the lowest possible significancelevel of the test that warrants the rejection of the null hypothesiswith that sample.

P-values
Imagine that we offer the student fifteen cups of tea. Using asignificance level of 2%, and entertaining the same hypotheses as inthe foregoing, we can determine a new region of rejection $R_{r}$ inthis larger sample space. It includes the sequence of fifteen correctguesses, the singleton $S_{15/15}$, but also the set of all 15sequences with one failure, $S_{14/15}$, the set of all 105sequences with two failures, $S_{13/15}$, and the set of all 455sequences with three failures, $S_{12/15}$. Every specific sequencehas the same probability according to the null hypothesis, namely$P_{h}(S_{15/15}) = 1/2^{15}$, so their summed probability is$P_{h}(R_{r}) = 576 / 2^{15} \approx 0.0176 = 1.76 \%$. Now imaginethat we find the student guesses all but one cup correctly. We canthen reject the null hypothesis at 2% significance level, because thesample falls within $R_{r}$. But in retrospect we could have set oursignificance level much lower, by including only the sample$S_{15/15}$ and the 15 samples collected in $S_{14/15}$ in thesmaller region of rejection $R_{r'}$, with a summed probability aslow as $P_{h}(R_{r'}) = 16/2^{15} \approx 0.05 \%$. To express howconvincingly we rejected the null hypothesis, we report the latternumber as the p-value.

The p-value is unfortunately ill-understood by statisticalpractitioners. We briefly consider the debate over it below.

3.1.2 Estimation

In this section we consider parameter estimation by maximumlikelihood, as first devised by Fisher (1956). Once again we employ afinite sample space $S$. In the tea tasting example, the samples$s$ are finite sequences of guesses. A set of such sequences isdenoted by an augmented capital $S$ as above, e.g., we write the setof sequences starting with five correct guesses as $S_{11111}$, andwe write the set of all sequences with $k$ successes in $n$ trialswith $S_{k/n}$.

Maximum likelihood estimation, or MLE for short, is a tool fordetermining the best fitting among a set of hypotheses, often called astatistical model. Let $M = \{h_{\theta} :\: \theta \in \Theta \}$be the model, labeled by the parameter $\theta$ and $P_{\theta}$the distribution associated with $h_{\theta}$. Then define themaximum likelihood estimator $\hat{\theta}$ as a function over thesample space:

\[\hat{\theta}(s) = \{ \theta :\, \forall \theta' \bigl( P_{\theta'}(s) \leq P_{\theta}(s) \bigr) \}.\]

So the estimator is a set, typically a singleton set, of values of$\theta$ for which the likelihood of $h_{\theta}$ on the data$s$ is maximal. The associated best hypothesis we denote with$h_{\hat{\theta}}$. This can be illustrated easily for the teatasting student.

Maximum likelihood estimation (MLE)
A natural statistical model for the case of the tea tasting studentconsists of hypotheses $h_{\theta}$ for all possible probabilitiesfor correctly guessing, or levels of accuracy, that the student mayhave, $\theta \in [0, 1]$. The number of correct guesses $k$ andthe total number of guesses $n$ are the sufficient statistics: theprobability of a sample only depends on those numbers. For the set offinite sequences $S_{k/n}$ the associated likelihood of$h_{\theta}$ is
\[ P_{\theta}(S_{k/n}) = \binom{n}{k} \theta^{k} (1 - \theta)^{n - k} . \]
The maximum likelihood estimator for a given observation $S_{k/n}$then is $\hat{\theta} = k / n$, because among all hypotheses$h_{\theta}$ the likelihood $P_{\theta}(S_{k/n})$ for $S_{k/n}$is maximal at that value.

We suppose that the number of cups served to the student is fixed at$n$ so that sample space is finite. Notice, finally, that$h_{\hat{\theta}}$ is the hypothesis that makes the data mostprobable and not the hypothesis that is most probable in the light ofthe data.

There are all manner of requirements that we might impose on anestimator function. One is that the estimator must be consistent, anconcept defined by reference to a true value $\theta^{\star}$. Anestimator function is consistent if and only if for ever largersamples $S_{n}$ the estimator function $\hat{\theta}$ converges tothe true parameter value $\theta^{\star}$. We call a hypothesis$h_{\theta^{\star}}$ true if and only if the data are distributedaccording to the associated distribution $P_{\theta^{\star}}$ or,equivalently, if the associated distribution $P_{\theta}$ is the onefrom which the data are generated. Another requirement is that theestimator must be unbiased, meaning that for finite data there is nodiscrepancy between the true parameter value and the expected value ofthe estimator, where this expected value is computed on the basis of$P_{\theta^{\star}}$. The classical statistical literature onestimation discusses several other such goodmaking features ofestimator functions.

MLE is not the only procedure for determining the value of a parameterof interest on the basis of statistical data. We may also maximize orminimize some other target function. In the context of curve fitting,for instance, we can minimize the sum of the squares of the distancesbetween the prediction of the statistical hypothesis and the givendata points, known as the method of least squares. However, under theassumption of a statistical model in which errors are normallydistributed, this procedure comes down to MLE. A more generalperspective, first developed by Wald (1950), is provided by measuringthe discrepancy between the predictions of the hypothesis and theactual data in terms of a loss function. The summed squares and thelikelihoods may be taken as expressions of this loss. The principlethat expected loss minimization is at the core of statistics has beendeveloped in numerous directions since. Statistical learning theory,which is briefly covered below, offers a systematic analysis of thisapproach to statistics that can arguably underpin the estimationprocedures of classical statistics (cf. Hastie et al. 2009, James etal. 2014).

3.1.3 Confidence intervals

Often the estimation is coupled to a so-called confidence interval. Toexplain this notion, we first construct a closely related notion, theregion of probable estimation. For ease of exposition, assume that$\Theta$ consists of the real numbers and that every point $s$ inthe sample space is labelled with a single value of the estimationfunction $\hat{\theta}(s)$. We define the set $R_{\tau} = \{ s:\:\hat{\theta}(s) = \tau \}$, the set of samples for which theestimator function has the value $\tau$. We can now collate a regionin sample space within which the estimator function $\hat{\theta}$is not too far off the mark, specifically, differing by at most$\Delta$ on either side of the true parameter value$\theta^{\star}$:

\[ R_{\Delta} = \{ R_{\tau} :\: \tau \in [ \theta^{\star} - \Delta , \theta^{\star} + \Delta ] \} . \]

This so-called region of probable estimation is the union of all sets$R_{\tau}$ for which $\tau \in [ \theta^{\star} - \Delta ,\theta^{\star} + \Delta ]$.

We may choose the value of $\Delta$ in such a way that the region ofprobable estimation $R_{\Delta}$ covers a large portion of thesample space, as measured by the true distribution$P_{\theta^{\star}}$:

\[ P_{\theta^{\star}}(R_{\Delta}) = \int_{\theta^{\star} - \Delta}^{\theta^{\star} + \Delta} P_{\theta^{\star}}(R_{\tau}) d\tau = 1 - \alpha ,\]

where $\alpha$ is a margin of error. If we fix a specific margin oferror $\alpha$ , this determines size of $\Delta$, which isthereby set to $\Delta_{1 - \alpha}$. The size of the latter sayssomething about the quality of the estimate: if $\Delta_{1-\alpha}$is small, the estimate $\hat{\theta}$ is accurate. Notice that suchregions of probable estimation carry a clear frequentistinterpretation: when repeating the data collection under theassumption of the true distribution, the fraction of times in whichthe estimator $\hat{\theta}$ is further away from $\theta^{\star}$than $\Delta_{1-\alpha}$, tends to $\alpha$.

We can now define the confidence interval around an estimate. The keymove is that we take the estimate $\hat{\theta}$ as a stand-in forthe true parameter value $\theta^{\star}$. Starting from a specificestimation $\hat{\theta}$, we can calculate a region of probableestimation $R_{\Delta}$ based on an error rate $\alpha$ by puttingthe estimate in the role of the true distribution, i.e., bysubstituting $\theta^{\star} = \hat{\theta}$ in the foregoingformulas and calculating $\Delta_{1-\alpha}$ under this assumption.The symmetric confidence interval is then defined as:

\[ C_{1 - \alpha} = [ \hat{\theta} - \Delta_{1 - \alpha} , \hat{\theta} + \Delta_{1 - \alpha} ] . \]

Statistical folklore typically sets $\alpha$ at a value of 5%,leading to the symmetric 95% confidence interval $C_{95\%}$. Theconfidence interval is often reported alongside an estimate toindicate the estimate’s accuracy, and thereby communicatesomething about the size of the effect of an intervention precedingthe data collection (cf. Cumming 2012).

We have to be very careful when interpreting confidence intervals. Thesuggestion is that it can carry a frequentist interpretation much likethe region of probable estimation, as in: we find the true value$\theta^{\star}$ within $\Delta_{1 - \alpha}$ of our estimate$\hat{\theta}$ in a fraction of $1 - \alpha$ of all samples. Butthis is not right. For one, to compute $\Delta_{1 - \alpha}$ wesubstituted the estimate for the true value. The actual value of$\Delta_{1 - \alpha}$ may be different, depending on the location ofthe true parameter value and the value for $\Delta_{1 - \alpha}$ atthat point. However, even if $\Delta_{1 - \alpha}$ is the same whencalculated by means of the estimate and by means of the trueparameter, we still cannot understand the confidence interval as theinterval within which the true parameter value is included with afrequency of $1 - \alpha$. The true parameter value stays fixed, andtherefore it does or does not fall within the given interval, period.It makes no sense to say that it does this with a certainfrequency.

We can still make sense of confidence intervals in terms offrequencies. To do so, we have to assume that the value of $\Delta_{1- \alpha}$ is the same for all possible values of the true parameter.In that case we can say, of the specific confidence intervalcalculated from the currently obtained estimate $\hat{\theta}$, thatit covers all possible values of the true parameter value$\theta^{\star}$ for which, if they were the true value, thecurrently obtained estimate $\hat{\theta}$ would be included in theregion of probable estimation, while for true parameter values outsideof the specific confidence interval, the currently obtained estimatewould not be. Accordingly, on repeating the data collection andrepeatedly constructing a confidence interval around the newlyobtained estimate $\hat{\theta}$, maintaining the assumption that$\Delta_{1-\alpha}$ is constant from one confidence interval to thenext, the true parameter value is included in a fraction $1 -\alpha$ of this sequence of confidence intervals. That is, thefrequentist guarantee pertains to the procedure of constructingconfidence intervals, not to the specific confidence interval aroundany given estimate.

There are of course many more procedures for estimating a variety ofstatistical targets, and there are many more expressions for thequality of the estimation (e.g., bootstrapping, see Efron andTibshirani 1993). Theories of estimation often come equipped with arich catalogue of situation-specific criteria for estimators,reflecting the epistemic and pragmatic goals that the estimator helpsachieving. However, in itself the estimator functions do not presentguidelines for beliefs about the accuracy of any given estimate and,importantly, confidence intervals do not either.

3.2 Problems for classical statistics

Classical statistics is widely discussed in the philosophy ofstatistics. In what follows two problems with the classical approachare outlined, to wit, its problematic interface with belief and thefact that it violates the so-called likelihood principle. Many morespecific problems can be seen to derive from these general ones.

3.2.1 Interface with belief

Consider the likelihood ratio test of Neyman and Pearson. Asindicated, the significance level of a test is an error rate that willmanifest if data collection and testing is repeated, assuming that thenull hypothesis is in fact true. Notably, this value does not tell ushow probable the truth of the null hypothesis is, and neither does thep-value calculated from a specific sample. However, many scientists douse hypothesis testing in this manner, and there is much debate overwhat can and cannot be derived from a p-value (cf. Berger and Sellke1987, Casella and Berger 1987, Cohen 1994, Harlow et al 1997,Wagenmakers 2007, Ziliak and McCloskey 2008, Spanos 2007, Greco 2011,Sprenger 2016). After all, the test leads to the advice to eitherreject the hypothesis or accept it, and this seems conceptually veryclose to giving a verdict of truth or falsity. An attempt to provideit with an interpretation in terms of support has been undertaken byCox and Mayo (2006).

While the evidential value of p-values is much debated, many admitthat the probability of data according to a hypothesis cannot be usedstraightforwardly as an indication of how believable the hypothesis is(cf. Gillies 1971, Spielman 1974 and 1978). Such usage runs into theso-calledbase-rate fallacy. The example of the tea tastingstudent is again instructive.

Base-rate fallacy
Imagine that we travel the country to perform the tea tasting testwith a large number of students, and that we find a particular studentwho guesses all five cups correctly. Should we conclude that thestudent has a special talent for tasting tea? The problem is that thisdepends on how many students among those tested actually have thespecial talent. If the ability is very rare, it is more attractive toput the five correct guesses down to a chance occurrence. Bycomparison, imagine that all the students enter a lottery. In analogyto a student guessing all cups correctly, consider a student who winsone of the lottery’s prizes. In a normal lottery, winning aprize is very improbable, unless one is in cahoots with the bookmaker,which is the analogon of having a special tea tasting ability. Butsurely if a student wins the lottery, this is by itself not a goodreason to conclude that they must have committed fraud and call fortheir arrest. Similarly, if a student has guessed all cups correctly,we cannot simply conclude that they have special abilities.

Essentially the same problem occurs if we consider the estimations ofa parameter as direct advice on what to believe, as made clear by anexample of Good (1983, p. 57) that is presented here in the teatasting context. After observing five correct guesses, we have$\hat{\theta} = 1$ as maximum likelihood estimator. But it is hardlybelievable that the student will in the long run be 100% accurate. Thepoint that estimation and belief maintain complicated relations isalso put forward in discussions of Lindley’s paradox (Lindley1957, Spanos 2013, Sprenger 2013), and in the faulty interpretation ofconfidence intervals as regions of probable estimation. In short, itis wrongheaded to turn the results of classical statistical proceduresinto beliefs simpliciter.

It is a matter of debate whether any of this can be blamed onclassical statistics. Initially, Neyman was emphatic that his testingprocedures could not be taken as inferences, or as in some other waypertaining to the epistemic status of the hypotheses. His ownstatistical philosophy was strictly behaviorist (cf. Neyman 1957), soit may be argued that the problems disappear if only scientistsabandon their faulty epistemic use of classical statistics. Asexplained in the foregoing, we can uncontroversially associate errorrates with classical procedures, and so with the decisions that flowfrom these procedures. A behavioural and error-based understanding ofclassical statistics seems just fine. On the other hand, as furtherelaborated below, both statisticians and philosophers have argued thatan epistemic reading of classical statistics is possible, and in factpreferable (e.g., Fisher 1955, Royall 1997). Accordingly, many haveattempted to reinterpret or develop the theory, in order to align itwith the epistemically oriented statistical practice of scientists(see Mayo 1996, Mayo and Cox 2006, Mayo and Spanos 2011, Spanos2013b).

3.2.2 The nature of evidence

Hypothesis tests and estimations are sometimes criticised becausetheir results generally depend on the probability functions over theentire sample space, and not exclusively on the probabilities of theobserved sample. That is, the decision to accept or reject the nullhypothesis depends not just on the probability of what has actuallybeen observed according to the various hypotheses, but also on theprobability assignments over events that could have been observed butwere not. A well-known illustration of this problem concerns so-calledoptional stopping (Robbins 1952, Roberts 1967, Kadane et al1996, Mayo 1996, Howson and Urbach 2006).

Optional stopping is here illustrated for the likelihood ratio test ofNeyman and Pearson but a similar story can be run for Fisher’snull hypothesis test and for the determination of estimators andconfidence intervals.

Optional stopping
Imagine two researchers who are both testing the same student on hisability to determine the order in which milk and tea were poured inhis cup. They both entertain the null hypothesis that he is guessingat random, with a probability of $1/2$, against the alternative ofhis guessing correctly with a probability of $3/4$. The morediligent researcher of the two decides to record six trials. The moreimpatient researcher, on the other hand, records at most six trials,but decides to stop recording the first trial that the student guessesincorrectly. Now imagine that, in actual fact, the man guesses all butthe last of the cups correctly. Both researchers then have the exactsame data of five successes and one failure, and the likelihoods forthese data are the same for the two researchers too. However, whilethe diligent researcher cannot reject the null hypothesis, theimpatient researcher can.

This might strike us as peculiar: statistics should tell us theobjective impact that the data have on a hypothesis, but here theimpact seems to depend on thesampling plan of the researcherand not just on the data themselves. As further explained inSection 3.2.3, the results of the two researchers differ because of differences inhow samples that were not observed are factored into theprocedure.

Some will find this dependence unacceptable: the intentions and plansof the researcher are irrelevant to the evidential value of the data.But others argue that it is just right. They maintain that the impactof data on the hypotheses should depend on thestopping ruleor protocol that is followed in obtaining it, and not only on thelikelihoods that the hypotheses have for those data (e.g. Mayo 1996).The motivating intuition is that upholding the irrelevance of thestopping rule opens up the possibility for opportunistic choices indata collection. In fact, defenders of classical statistics turn thetable on those who maintain that optional stopping is irrelevant. Theysubmit that it allows us to reason to a foregone conclusion by, forexample,persistent experimentation: as a likelihoodist orBayesian we might decide to cease experimentation only if thepreferred result is reached. However, as shown in Kadaneetal. (1996a and 1996b) and further discussed in Steele (2012),persistent experimentation is not guaranteed to yield any desiredoutcome, as long as we make sure to align the procedures with theappropriate evidence conception, e.g., likelihoodist or Bayesian.

The debate over optional stopping is eventually concerned with theappropriate evidential impact of data. A central concern in this widerdebate is the so-calledlikelihood principle (see Hacking1965 and Edwards 1972). This principle has it that the likelihoods ofhypotheses for the observed data completely fix the evidential impactof those data on the hypotheses. In the formulation of Berger andWolpert (1984), the likelihood principle states that two samples $s$and $s'$ are evidentially equivalent exactly when $P_{i}(s) =kP_{i}(s')$ for all hypotheses $h_{i}$ under consideration, givensome constant $k$. Famously, Birnbaum (1962) offers a proof of theprinciple from more basic assumptions. This proof relies on theassumption of conditionality. Say that we first toss a coin todetermine what experiment to run, find that it lands heads, then dothe experiment associated with this outcome, to record the sample$s$. Compare this to the case where we do the experiment, withoutrandomly picking it first, and find $s$ directly. The conditionalityprinciple states that this second sample has the same evidentialimpact as the first one: what we could have found if we had randomlychosen to run another experiment, but did not find, has no impact onthe evidential value of the sample that we did find. Mayo (2010, 2014)has taken issue with Birnbaum’s derivation of the likelihoodprinciple but new defenses have been offered since (Dawid 2014,Gandenberger 2014).

The classical view sketched above entails a violation of thelikelihood principle: the impact of the observed data may be differentdepending on the probability of other samples than the observed one,because those other samples come into play when determining regions ofacceptance and rejection. The Bayesian procedures discussed inSection 4, on the other hand, uphold the likelihood principle: in determiningthe posterior distribution over hypotheses only the prior and thelikelihood of the observed data matter. In the debate over optionalstopping and in many of the other debates between classical andBayesian statistics, the likelihood principle is the focal point.

3.2.3 Excursion: optional stopping

The view that the data reveal more, or something else, than what isexpressed by the likelihoods of the hypotheses at issue meritsdetailed attention. Here we investigate this issue further withreference to the controversy over optional stopping.

Let us consider the analyses of the two above researchers in somenumerical detail by constructing the regions of rejection for both ofthem.

Determining regions of rejection
The diligent researcher considers all 6-tuples of success and failureas the sample space, and takes their numbers as sufficient statistic.The event of six successes, or six correct guesses, has a probabilityof $1 / 2^{6} = 1/64$ under the null hypothesis that the student ismerely guessing, against a probability of $3^{6} / 4^{6}$ under thealternative hypothesis. If we set $r < 3^{6} / 2^{6}$, then onlythis sample of six successes is included in the region of rejection ofthe null hypothesis. Samples with five successes have a probability of$1/64$ under the null hypothesis too, against a probability of $3^5/ 4^{6}$ under the alternative. So by lowering the thresholdratio $r$ by a factor 3, we also include the six sampleswith five successes and one failure in the region of rejection. Butthis will lead to a total probability of false rejection of $7/64$,i.e., a total of seven samples with a probability of$1/64$ each, which is larger than 5%. So these additional sixsamples cannot be included in the region of rejection, to avoid thatthe test surpasses a 5% significance level. Hence the diligentresearcher does not reject the null hypothesis upon finding fivesuccesses and one failure.
For the impatient researcher, on the other hand, the sample space ismuch smaller. Apart from the sample consisting of six successes, allsamples consist of a series of successes ending with a failure,differing only in the length of the series. Yet the probabilities overthe two samples of length six are the same as for the diligentresearcher. As before, the sample of six successes is again includedin the region of rejection. Similarly, the sequence of five successesfollowed by one failure also has a probability of $1/64$ under thenull hypothesis, against a probability of $3^5 / 4^{6}$ according tothe alternative. The difference is that lowering the likelihood ratioto include this sample in the region of rejection leads to theinclusion of the sample of a failure at the end only. And if weinclude this single extra sample in the region of rejection, theprobability of false rejection under the null hypothesis becomes$1/32$ and hence does not exceed 5%. Consequently, on the basis ofthese data the impatient researcher can reject the null hypothesisthat the student is merely guessing.

It is instructive to consider why exactly the impatient researcher canreject the null hypothesis. In virtue of the sampling plan, the othersamples with five successes, namely the ones which kept the diligentresearcher from including the observed sample in the region ofrejection on pain of exceeding the error probability, could not havebeen observed. This exemplifies that the results of a classicalstatistical procedure do not only depend on the likelihoods for theactual data, which are indeed the same for both researchers. They alsodepend on the likelihoods for data that we did not obtain.

In the above example, it may be considered confusing that the protocolused for optional stopping depends on the data that is being recorded.But the controversy over optional stopping also emerges if there is nosuch interdependence between stopping rule and data. For example,imagine a third researcher who samples until the diligent researcheris done, or aborts sampling before that if she starts to feel peckish.Furthermore we may suppose that with each new cup offered to thestudent, the probability of feeling peckish is $\frac{1}{2}$. Itturns out that this peckish researcher will also be able to reject thenull hypothesis if she completes a series of five successes and onefailure. It certainly seems at variance with the objectivity of thestatistical procedure that this rejection depends on the physiologyand the state of mind of the researcher: if she had not kept open thepossibility of a snack break, she would not have rejected the nullhypothesis, even though she did not actually take that break. AsJeffreys (1961, p. 385) famously quipped, this is indeed a“remarkable procedure”.

Yet the case is not as clear-cut as it may seem. For one, the peckishresearcher is arguably testing two hypotheses in tandem, one about theability of the tea tasting student and another about her ownpeckishness. Together the combined hypotheses have a differentlikelihood for the actual sample than the simple hypothesis consideredby the diligent researcher. The likelihood principle given abovedictates that this difference does not affect the evidential impact ofthe actual sample, but some retain the intuition that it should.Moreover, in some cases this intuition is shared by those who upholdthe likelihood principle, namely when the stopping rule depends on theprocess being recorded in a way not already expressed by thehypotheses at issue (cf. Robbins 1952, Howson and Urbach 2006, p.365). In terms of our example, if the student is merely guessing andhence gets it right only by chance, then it may be more probable thatthe researcher gets peckish out of sheer boredom, than if the studentperforms far below or above chance level. In such a case the act ofstopping itself reveals something about the hypotheses at issue, andthis should be reflected in the likelihoods of the hypotheses. Thismakes the evidential impact that the data have on the hypothesisdependent on the stopping rule after all. And so the controversy overthe relevance of the stopping rule continues (cf. Steel 2003, Fletcher2023).

3.3 Responses to criticism

There have been numerous responses to the above criticisms. Some ofthose responses effectively reinterpret the classical statisticalprocedures as pertaining only to the evidential impact of data. Otherresponses develop the classical statistical theory to accommodate theproblems. Their common core is that they establish or at least clarifythe connection between two conceptual realms: the statisticalprocedures refer to physical probabilities, while their resultspertain to evidence and support, or else to the rejection oracceptance of hypotheses.

3.3.1 Likelihoodism

Classical statistics is often presented as providing us with advicefor actions. The error probabilities do not tell us what epistemicattitude to take on the basis of statistical procedures, rather theyindicate the long-run frequency of error if we live by them.Specifically Neyman advocated this interpretation of classicalprocedures. Against this, Fisher (1935a, 1955), Pearson, and otherclassical statisticians have argued for more epistemicinterpretations, and many more recent authors have followed suit. Inapplications of statistics in the sciences, the Bayes factor hasbecome an increasingly important measure of evidential strength (cf.Morey et al. 2016).

Central to the above discussion on classical statistics is the conceptof likelihood, which reflects how the data bears on the hypotheses atissue. In the works of Hacking (1965), Edwards (1972), and morerecently Royall (1997), the likelihoods are taken as a cornerstone forstatistical procedures and given an epistemic interpretation. They aresaid to express the strength of the evidence presented by the data, orthe comparative degree of support that the data give to a hypothesis.Hacking formulates this idea in the so-called law of likelihood (1965,p. 59): if the sample $s$ is more probable on the condition of$h_{0}$ than on $h_{1}$, then $s$ supports $h_{0}$ more thanit supports $h_{1}$.

The position of likelihoodism is based on a specific combination ofviews on probability. On the one hand, it only employs probabilitiesover sample space, and avoids putting probabilities over statisticalhypotheses. It thereby avoids the use of probability that cannot begiven an ontic interpretation. On the other hand, it does interpretthe probabilities over sample space as components of a supportrelation, and thereby as pertaining to the epistemic rather than thephysical realm. Notably, the likelihoodist approach fits well with along history in formal approaches to epistemology, in particular withconfirmation theory (see the entry onconfirmation), in which the probability theory is used to spell out confirmationrelations between data and hypotheses. Measures of confirmationinvariably take the likelihoods of hypotheses as input components.They provide a quantitative expression of the support relationsdescribed by the law of likelihood.

3.3.2 Error statistics and severe testing

Another epistemic approach to classical statistics is presented byMayo (1996, 2018) and Mayo and Spanos (2011). Over the past twodecades, they have done much to push the agenda of classicalstatistics in the philosophy of science, which had become dominated byBayesian statistics. Countering the original behaviourist tendenciesof Neyman, the error statistical approach advances an epistemicreading of classical test and estimation procedures. Mayo and Spanosargue that classical procedures are best understood as inferential:they license inductive inferences. But they readily admit that theinferences are defeasible, i.e., they can lead us astray. Classicalprocedures are always associated with particular error probabilities,e.g., the probability of a false rejection or acceptance, or theprobability of an estimator falling within a certain range. In thetheory of Mayo and Spanos, these error probabilities obtain anepistemic role, because they are taken to indicate the reliability ofthe inferences licensed by the procedures.

The error statistical approach of Mayo and others comprises a generalphilosophy of science as well as a particular viewpoint on thephilosophy statistics. We briefly focus on the latter, through adiscussion of the notion of a severe test (cf. Mayo and Spanos 2006).The claim is that we gain knowledge of experimental effects on thebasis of, what Mayo and others call, severely testing hypotheses, aconcept that can be characterized by reference to the significance andpower of the statistical tests involved. In Mayo’s definition, ahypothesis passes a severe test on two conditions: the hypothesis mustagree with the data, and with high probability, if the hypothesis werein fact false, then it would not agree with the data. Ignoringpotential controversy over the precise interpretation of“agreeing with the data” and “lowprobability”, we can recognize the tests characteristics ofNeyman and Pearson in these conditions: a test can be called severe ifthe error rates are low. More precisely, we might say that analternative hypothesis passes a severe test if the power is high,which somewhat resembles the condition that the alternative hypothesisagrees with the data: the data fall into a region with highprobability according to the alternative hypothesis. Furthermore, thetest is severe if the significance level of the test is low, whichresembles the condition that if the alternative hypothesis is false,and hence some null hypothesis true, the probability that this nullhypothesis agrees with the data is low, in the sense that the datafall into a region with low probability according to the nullhypothesis.

There are, however, some differences between the criteria of Neymanand Pearson and the criterion of test severity that merit closeattention. Importantly, Neyman and Pearson are concerned with testcharacteristics that derive from the probability assignments oversample space, as determined by the null and alternative hypotheses,and not with specific samples and specific instances of testing. Theseverity condition, by contrast, pertains to the particular sampleobtained for a test: one condition for calling a test severe is thatthe alternative hypothesis must agree with the sample actuallyobtained. This condition does not exactly match the criterion ofNeyman and Pearson testing that the power is high, as this pertains tothe probability of the entire region of rejection according to thealternative hypothesis. A similar remark applies to the secondcondition of severity, which is that if the alternative hypothesis isfalse, then with high probability the data would probably not agreewith it. As suggested this condition relates to the significance levelof the test but it is not captured adequately by requiring that theprobability of the whole region of rejection is low according to thenull hypothesis, because we want the null hypothesis to give lowprobability to the data actually obtained. For this reason we can saythat the second condition of severity is loosely expressed in thep-value, i.e., the maximal significance level at which a test canreject the null hypothesis for a given sample.

The error statistical approach shows similarities to the likelihoodistapproach, in that it focuses on the evidence presented by a sample. Itthereby differs from earlier views on classical statistics thatmotivate the procedures by reference to the frequency of success in aseries of repeated applications. However, error statistics alsodiffers from the likelihoodist approach, especially in what may becalled its falsificationist orientation. The evidence presented by astatistical test or an estimation is not merely comparative, and thereis no presumption that one of the hypotheses under consideration istrue, or adequate, as is sometimes claimed about Bayesian statisics(e.g., Dawid 1982). Instead the error statistical approach leaves openthat the data agrees with none of the hypotheses, reflecting that theassumptions of the procedures may be falsified and are open torevision. This fundamental openness to revision has inspired scholarsacross the board (e.g., Gelman and Shalizi 2013).

3.3.2 Theoretical developments

Apart from re-interpretations of the classical statistical procedures,numerous statisticians and philosophers have developed the theory ofclassical statistics further in order to make good on the epistemicrole of its results. We focus on four developments in particular, towit, fiducial, evidential, and game-theoretic probability, and the useof e-values rather than p-values.

The theory of evidential probability originates in Kyburg (1961), whodeveloped a logical system to deal consistently with the results ofclassical statistical analyses. Evidential probability thus fallswithin the attempts to establish the epistemic use of classicalstatistics. Haenni et al (2010) and Kyburg and Teng (2001) presentinsightful introductions to evidential probability. The system isbased on a version default reasoning: statistical hypotheses comeattached with a confidence level, and logical rules organize how suchconfidence levels are propagated in inference, and thus advises whichhypothesis to use for predictions and decisions. Particular attentionis devoted to the propagation of confidence levels in inferences thatinvolve multiple instances of the same hypothesis tagged withdifferent confidences, where those confidences result from diversedata sets that are each associated with a particular population.Evidential probability assists in selecting the optimal confidencelevel, and thus in choosing the appropriate population for the caseunder consideration. In other words, evidential probability helps toresolve the reference class problem alluded in the foregoing.

Fiducial probability presents another way in which classicalstatistics can be given an epistemic status. Fisher (1930, 1933,1935c, 1956/1973) developed the notion of fiducial probability as away of deriving a probability assignment over hypotheses withoutassuming a prior probability over statistical hypotheses at theoutset. The fiducial argument is controversial, and it is generallyagreed that its applicability is limited to particular statisticalproblems. Dempster (1964), Hacking (1965), Edwards (1972), Seidenfeld(1996) and Zabell (1996) provide insightful discussions. Seidenfeld(1979) presents a particularly detailed study and a further discussionof the restricted applicability of the argument in cases with multipleparameters. Dawid and Stone (1982) argue that in order to run thefiducial argument, one has to assume that the statistical problem canbe captured in a functional model that is smoothly invertible.Dempster (1966) provides generalizations of this idea for cases inwhich the distribution over $\theta$ is not fixed uniquely but onlyconstrained within upper and lower bounds (cf. Hannig 2009, Haenni etal 2011). Crucially, such constraints on the probability distributionover values of $\theta$ are obtained without assuming anydistribution over $\theta$ at the outset.

The idea of game-theoretic probability is a comparatively recent onein the development of classical statistics (Shafer and Vovk 2001 and2019, Shafer 2021). The basic idea of this approach is to replace thecategorical verdicts of failing and passing a statistical test by agradual expression of agreement with the data in terms of bets andtheir payoffs. We start by viewing the distribution of the nullhypothesis, $P_{h}(S)$, as a collection of betting offers by abookie. A gambler is allowed to choose any payoff function $F(s) >0$, which determines what they receive when observing $s$. The fairprice for the collection of bets according to the bookie, as encodedin the null hypothesis, then is the expectation value:

\[ E_{P_{h}}(F) = \sum_{s \in W} P_{h}(s)F(s) \]

Say that the gambler buys the collection of bets for the fair priceaccording to the null, and that they subsequently observe $s$ andreceive the payoff $F(s)$. If this payoff is larger than the fairprice, this counts as evidence against the null.

We can give a further interpretation of this payoff function as ameasure of evidence. Setting $E_{P}(F) = 1$, we see that $P_{h'}(s)= P_{h}(s) F(s)$ is another probability distribution over $W$. Wecan associate this distribution with the alternative hypothesis$h'$. This turns the payoff function into the likelihood ratio forthe two hypotheses,

\[ F(s) = \frac{P_{h'}(s)}{P_{h}(s)}\,. \]

As Shafer (2021) argues, the likelihood ratio expresses the payofffunction that a gambler who adopts the alternative hypothesis $h'$might choose, when buying a collection of bets from a bookie whoadopts the null hypothesis $h$. The reason is that this optimizes onthe growth rate of the betting revenues, and thus maximizes on acertain conception of epistemic gain. The likelihood ratio thus offersa natural expression of how strong the evidence is. However, for abetting conception of testing that allows us to express evidentialstrength, we need not necessarily adopt any hypothesis as thealternative, nor do we need to fix a payoff function in any particularway in order to reap the conceptual benefits. The bettingunderstanding of testing is clearly classical in that it avoids anepistemic interpretation of the procedures or the payoffs, suggestinga decision- or game-theoretic interpretation instead. But it offersmajor advantages over traditional conceptions of testing, especiallybecause it facilitates the build-up of statistical results acrossstudies, where each study may choose its own payoff function.

The desideratum that we can combine and reuse statistical results innew contexts is a leading motivation for a related and very recentdevelopment within the classical statistical approach, namely theexpression of the evidential impact of the data on statisticalhypotheses in terms of so-called e-values (Grunwald 2023, 2024).Following the decision-theoretical outlook of Wald (1939), a centralnotion in this development is the lossL that is associatedwith hypotheses, denoted $h$, and actions, labelled by numbers$a$, for instance to reject or accept some hypothesis and takefurther action on that basis. Such a loss function helps us to weighup type-I and type-II errors against each other, and eventuallydetermines the decision to reject the null hypothesis and accept thealternative, or otherwise, depending on the data obtained.

E-values offer an alternative to the p-values that are often reportedalongside a decision to reject the null hypothesis. The p-valuesindicate that we could have rejected the null hypothesis with a lowertype-I error, and often serve as an expression of the strength ofevidence against the null. But e-values have several properties thatmake them more attractive than p-values. For one, e-values are definedfor collections of distributions rather than single ones, thus leavingroom for composite hypotheses. Most importantly, e-values retain theirevidential meaning under post-hoc changes to the loss function $L(h,a)$, including changes in the number of available actions and in theset of hypotheses under consideration. This is particularly relevantto cases in which, because of the data obtained, we wish to reconsiderour statistical analysis and our decision context, or in which thedata affect what actions are considered in the first place.

Crucial in the approach is the so-called e-variable, denoted $G$,defined by the requirement that $E_{P_{h}}(G) \leq 1$, whereh is the null hypothesis. The e-variable is comparable to thepayoff function $F$ in the game-theoretical approach to classicalstatistics discussed above. Assuming that the actionsa areordered so that the loss function $L(h, a)$increases monotonicallyina, while $L(h', a)$ decreases monotonically ina, the decision rule is that we choose the action $a$ withmaximal value for which

\[ L( h, a) \lt G(s) r , \]

in which $G(s)$ is the e-value of the observed sample $s$ and$r$ a level of maximally acceptable risk. A large e-value $G(s)$expresses that the sample $s$ presents strong evidence against thenull, so that a more extreme action can be chosen, because we are morecertain of the falsity of the null hypothesis.

For comparing two simple hypotheses, a natural choice for thee-variable is the likelihood ratio, i.e., $G(s) = F(s)$, so that weselect more extreme actions when our data presents stronger evidence.But the concept of an e-variable is much more general. We canreproduce the standard Neyman-Pearson test in this framework, andexpand to other, more sophisticated test functions by choosing othere-variables. And we can define the notion of ane-posterior,replacing the confidence intervals detailed above by a notion thatstands on firmer decision-theoretic grounds. However, while these areexciting new developments, the theory of e-values and e-posteriors hasnot yet achieved maturity and its applicability in practice has yet tobe determined.

3.3.3 Excursion: the fiducial argument

To explain the fiducial argument we first set up a simple example. Saythat we estimate the mean $\theta$ of a normal distribution withunit variance over a variable $X$. We collect a sample $s$consisting of measurements $X_{1}, X_{2}, \ldots X_{n}$. The maximumlikelihood estimator for $\theta$ is the average value of the$X_{i}$, that is, $\hat{\theta}(s) = \sum_{i} X_{i} / n$. Under anassumed true value $\theta^{\star}$ we then have a normaldistribution over the values of the estimator $\hat{\theta}(s)$,centred on the true value and with variance $1 / \sqrt{n}$. Notably,this distribution has the same shape for all values of$\theta^{\star}$. Fisher argued that we can therefore use thedistribution for the estimator $\hat{\theta}(s)$ given the sample$s$ as a stand-in for the distribution over the true value$\theta^{\star}$, and derive a probability distribution$P_{\text{Fid}}(\theta^{\star})$ on the basis of a sample $s$,seemingly without assuming a prior probability.

There are several ways to clarify this so-called fiducial argument.One way employs a so-called functional model, i.e., the specificationof a statistical model by means of a particular function relatingstatistical parameters and samples. For the above model, the functionis

\[ f(\theta, \epsilon) = \theta + \epsilon = \hat{\theta}(s). \]

It relates possible parameter values $\theta$ to a quantity based onthe sample, in this case the estimator of the observations$\hat{\theta}$. The two are related through a stochastic component$\epsilon$ whose distribution is known. In the example case, it is aGaussian with a mean at $0$ and with variance $1 / \sqrt{n}$.Importantly, the distribution of $\epsilon$ is the same for everyvalue of $\theta$. The interpretation of the function $f$ may nowbe apparent: relative to the choice of a value of $\theta$, whichthen takes the role of the true value $\theta^{\star}$, thedistribution over $\epsilon$ dictates the distribution over theestimator function $\hat{\theta}(s)$.

The idea of the fiducial argument is to project the distribution overthe stochastic component $\epsilon$ back onto the possible parametervalues $\theta$. The key observation is that the functional relation$f(\theta, \epsilon)$ is smoothly invertible, i.e., the function

\[ f^{-1}(\hat{\theta}(s), \epsilon) = \hat{\theta}(s) - \epsilon = \theta \]

points each combination of $\hat{\theta}(s)$ and $\epsilon$ to aunique parameter value $\theta$. Hence, we can invert the claim ofthe previous paragraph: relative to fixing a value for$\hat{\theta}$, the distribution over $\epsilon$ fully determinesthe distribution over $\theta$. In virtue of the inverted functionalmodel, we can therefore transfer the normal distribution over$\epsilon$ to the values $\theta$ around $\hat{\theta}(s)$. Thisyields a so-called fiducial probability distribution over theparameter $\theta$, denoted $P_{\text{Fid}}$. The distribution isobtained because, conditional on the value of the estimator, theparameters and the stochastic terms become perfectly correlated. Adistribution over the latter is then automatically applicable to theformer (cf. Haenni et al, 52–55 and 119–122).

Another way of explaining the same idea invokes the notion of apivotal quantity. Because of how the above statistical model is setup, we can construct the pivotal quantity $\hat{\theta}(s) -\theta$. We know the distribution of this quantity since it is thedistribution of $\epsilon$ , namely normal and with variance $1 /\sqrt{n}$. Moreover, this distribution is independent of the sample$s$, and it is such that, after observing the sample and thus fixingthe value of $\hat{\theta}(s)$, it uniquely determines adistribution over the parameter values $\theta$. In sum, thefiducial argument allows us to construct a probability distributionover the parameter values on the basis of the observed sample. Theargument can be run whenever we can construct a pivotal quantity or,equivalently, whenever we can express the statistical model as afunctional model.

In order to properly appreciate the precise inferential move and itswobbly conceptual basis, it will be instructive to consider the use offiducial probability in interpreting confidence intervals. Asexplained inSection 3.1.3, such intervals indicate the quality of an estimation but they areoften mistakenly interpreted epistemically: the 95% confidenceinterval is often misunderstood as the range of parameter values thatincludes the true value with a confidence of 95% probability, aso-called 95% credal interval:

\[ P(\theta^{\star} \in [\hat{\theta} - \Delta, \hat{\theta} + \Delta]) = 0.95. \]

The idea to assign probabilities to possible values is in directconflict with classical statistics. But the interpretation can bemotivated by an application of the fiducial argument. Once we havecomputed the distribution $P_{\text{Fid}}(\theta)$, it becomespossible to express a fiducial interval with it, and determine thevalue of $\Delta$ such that $P_{\text{Fid}}(\theta^{\star} \in[\hat{\theta} - \Delta, \hat{\theta} + \Delta])$ adds up to a chosenconfidence level $1 - \alpha$ . For the example above and for othersamenable to the fiducial argument, the fiducial interval coincidesnumerically with the confidence interval, which might explain why themisinterpretation of confidence intervals is so pervasive.

A warning is in order: the fiducial argument is controversial and itsproper interpretation is a matter of debate. The probabilitiesappearing in classical statistical methods are normally interpreted asfrequencies of events, offering guarantees of low error rates whenmethods are applied repeatedly, while the probability distributionover hypotheses that is generated by a fiducial argument carries anepistemic interpretation. But it is not clear that we can take thedistribution $P_{\text{Fid}}$ as an expression of our beliefs, northat we may support the epistemic interpretation of confidenceintervals with it. For this reason fiducial probability is perhapsbest understood as a half-way house between the classical and theBayesian view on statistics. Several authors (e.g., Dempster 1964)have noted that fiducial probability indeed makes most sense in aBayesian perspective. It is to this perspective that we now turn.

4. Bayesian statistics

Bayesian statistical methods are often presented in the form of aninference. The inference runs from a so-called prior probabilitydistribution over statistical hypotheses, which expresses the degreeof belief in the hypotheses before data has been collected, to aposterior probability distribution over the hypotheses, whichexpresses the beliefs after the data have been incorporated. Theposterior distribution follows, via the axioms of probability theory,from the prior distribution and the likelihoods of the hypotheses forthe data obtained, i.e., the probability that the hypotheses assign tothe data. Bayesian methods thus employ data to modulate our attitudetowards a designated set of statistical hypotheses. Viewed abstractly,both classical and Bayesian statistics present a response to theproblem of induction. But whereas classical procedures select oreliminate elements from the set of hypotheses, Bayesian methodsexpress the impact of data in a posterior probability assignment overthe set.

The defining characteristic of Bayesian statistics is that itconsiders probability distributions over statistical hypotheses aswell as over data. It thereby embraces the epistemic interpretation ofprobability: probabilities over hypotheses are interpreted as degreesof belief, i.e., as expressions of epistemic uncertainty. Thephilosophy of Bayesian statistics is concerned with determining theappropriate interpretation of these input components, and of themathematical formalism of probability itself, ultimately with the aimto justify the output. Notice that the general pattern of a Bayesianstatistical method is that of inductivism in the cumulative sense:under the impact of data we move to more and more informedprobabilistic opinions about the hypotheses. However, in the followingit will appear that Bayesian methods may also be understood asdeductivist in nature.

4.1 Basic pattern of inference

Bayesian inference always starts from a statistical model, i.e., a setof statistical hypotheses. While the general pattern of inference isthe same, we treat models with a finite number and a continuum ofhypotheses separately and draw parallels with hypothesis testing andestimation, respectively. The exposition is mostly based on Earman1992, Press 2002, Howson and Urbach 2006, and Gelman et al 2013.

4.1.1 Finite model

Central to Bayesian methods is a theorem from probability theory knownas Bayes’ theorem. Relative to a prior probability distributionover hypotheses, and the probability distributions over sample spacefor each hypothesis, it tells us what the adequate posteriorprobability over hypotheses is. More precisely, let $s$ be thesample and $S$ be the sample space as before, and let $M = \{h_{\theta} :\: \theta \in \Theta \}$ be the model, i.e., the space ofstatistical hypotheses, with $\Theta$ the space of parameter values.The function $P$ is a probability distribution over the entire space$M \times S$, meaning that every element $h_{\theta}$ isassociated with its own sample space $S$, and its own probabilitydistribution over that space. For the latter, which is fullydetermined by the likelihoods of the hypotheses, we write theprobability of the sample conditional on the hypothesis, $P(s \midh_{\theta})$. This differs from the expression $P_{h_{\theta}}(s)$,written in the context of classical statistics, because in contrast toclassical statisticians, Bayesians accept $h_{\theta}$ as anargument for the probability distribution.

Bayesian statistics is first introduced in the context of a finite setof hypotheses, after which a generalization to the infinite case isprovided. Assume the prior probability $P(h_{\theta})$ over thehypotheses $h_{\theta} \in M$. Further assume the likelihoods $P(s\mid h_{\theta})$, i.e., the probability assigned to the data $s$conditional on the hypotheses $h_{\theta}$. Then Bayes’theorem determines that

\[ P(h_{\theta} \mid s) \; = \; \frac{P(s \mid h_{\theta})}{P(s)} P(h_{\theta}) . \]

Bayesian statistics outputs the posterior probability assignment,$P(h_{\theta} \mid s)$. This expression gets the interpretation ofan opinion concerning $h_{\theta}$ after the sample $s$ has beenrecorded accommodated, i.e., it is a revised opinion. Further resultsfrom a Bayesian inference can all be derived from the posteriordistribution over the statistical hypotheses. For instance, we can usethe posterior to determine the most probable value for the parameter,i.e., picking the hypothesis $h_{\theta}$ for which $P(h_{\theta}\mid s)$ is maximal.

In this characterization of Bayesian statistical inference theprobability of the data $P(s)$ is not presupposed, because it can becomputed from the prior and the likelihoods by the law of totalprobability,

\[ P(s) \; = \; \sum_{\theta \in \Theta} P(h_{\theta}) P(s \mid h_{\theta}) . \]

This expression is often called the marginal likelihood of the model:it expresses how probable the data is in the light of the model as awhole. The result of a Bayesian statistical inference is not alwaysreported as a posterior probability. Often the interest is only incomparing the ratio of the posteriors of two hypotheses. ByBayes’ theorem we have

\[ \frac{P(h_{\theta} \mid s)}{P(h_{\theta'} \mid s)} \; = \; \frac{P(h_{\theta}) P(s \mid h_{\theta})}{P(h_{\theta'}) P(s \mid h_{\theta'})} , \]

and if we assume equal priors $P(h_{\theta}) = P(h_{\theta'})$, wecan use the ratio of the likelihoods of the hypotheses, the so-calledBayes factor, to compare the hypotheses.

Here is a Bayesian procedure for the example of the tea tastingstudent.

Bayesian statistical analysis
In the tea tasting example, consider the hypotheses $h_{1/2}$ and$h_{3/4}$, which in the foregoing were used as null and alternative,$h$ and $h'$, respectively. Instead of choosing among them on thebasis of the data, we assign a prior distribution over them so thatthe null is twice as probable as the alternative: $P(h_{1/2}) = 2/3$and $P(h_{3/4}) = 1/3$. Denoting a particular sequence of guessing$n$ out of 5 cups correctly with $s_{n/5}$, we have that$P(s_{n/5} \mid h_{1/2}) = 1 / 2^{5}$ while $P(s_{n/5} \midh_{3/4}) = 3^{n} / 4^{5}$. As before, the likelihood ratio of fiveguesses thus becomes
\[ \frac{P(s_{n/5} \mid h_{3/4})}{P(s_{n/5} \mid h_{1/2})} \; = \; \frac{3^{n}}{2^{5}} . \]
The posterior ratio after 5 correct guesses is thus
\[ \frac{P(h_{3/4} \mid s_{n/5})}{P(h_{1/2} \mid s_{n/5})} \; = \; \frac{3^{5}}{2^{5}}\, \frac{1}{2} \approx 4 . \]
This posterior is derived by the axioms of probability theory alone,in particular by Bayes’ theorem. It tells us how believable eachof the hypotheses is after incorporating the sample data into ourbeliefs.

Notice that in the above exposition, the posterior probability iswritten as $P(h_{\theta} \mid s_{n/5})$. Some expositions ofBayesian inference prefer to express the revised opinion as a newprobability function $P_{s}( \cdot )$, which is then equated to theold probability conditional on the sample, $P( \cdot \mid s)$. Forthe basic formal workings of Bayesian inference, this distinction isinessential. But we will return to it inSection 4.3.3.

4.1.2 Continuous model

In many applications the model is not a finite set of hypotheses, butrather a continuum labelled by a real-valued parameter. This leads tosome subtle changes in the definition of the distribution overhypotheses and the likelihoods. The prior and posterior must bewritten down as a so-called probability density function, denoted withthe lowercase $p(h_{\theta})$, such that for any set of hypotheses$M$ we can write

\[P(M) = \int_{M} p(h_{\theta}) d\theta , \]

where $P$ is again an ordinary probability function. Accordingly,$P(h_{\theta}) = p(h_{\theta}) d\theta$ is the infinitely smallprobability assigned to an infinitely small patch $d\theta$ aroundthe point $\theta$.

The likelihoods need to be defined by a limit process: the probability$P(h_{\theta})$ is infinitely small so that we cannot define $P(s\mid h_{\theta})$ in the normal manner. But other than that theBayesian machinery works exactly the same:

\[ P(h_{\theta} \mid s) d\theta \;\; = \;\; \frac{P(s \mid h_{\theta})}{P(s)} P(h_{\theta}) d\theta. \]

Finally, summations need to be replaced by integrations:

\[ P(s) \; = \; \int_{\theta \in \Theta} P(h_{\theta}) P(s \mid h_{\theta}) d\theta . \]

This is again the marginal likelihood of the model, computed by thelaw of total probability.

The posterior probability density provides a basis for conclusionsthat one might draw from the sample $s$, and which are similar toestimations and measures for the accuracy of the estimations. For one,we can derive an expectation for the parameter $\theta$, where weassume that $\theta$ varies continuously:

\[ \bar{\theta} \;\; = \;\; \int_{\Theta}\, \theta P(h_{\theta} \mid s) d\theta. \]

If the model is parameterized by a convex set, which it typically is,then there will be a hypothesis $h_{\bar{\theta}}$ in the model.This hypothesis can serve as a Bayesian estimation. An alternativenotion of estimation uses the mode of the posterior distribution,i.e., the value of $\theta$ where the posterior is maximal. Inanalogy to the confidence interval, we can also define the credalinterval or credibility interval from the posterior probabilitydistribution: an interval of size $2d$ around the expectation value$\bar{\theta}$, written $[\bar{\theta} - d, \bar{\theta} + d]$,such that

\[ \int_{\bar{\theta} - d}^{\bar{\theta} + d} P(h_{\theta} \mid s) d\theta = 1-\epsilon . \]

This range of values for $\theta$ is such that the posteriorprobability of the corresponding $h_{\theta}$ adds up to$1-\epsilon$ of the total posterior probability.

There are many other ways of defining Bayesian estimations and credalintervals for $\theta$ on the basis of the posterior density. Thespecific type of estimation that the Bayesian analysis offers can bedetermined by the demands of the scientist. Any Bayesian estimationwill to some extent resemble the maximum likelihood estimator due tothe central role of the likelihoods in the Bayesian formalism.However, the output will also depend on the prior probability over thehypotheses, and generally speaking it will only tend to the maximumlikelihood estimator when the sample size tends to infinity. SeeSection 4.2.2 for more on this so-called “washing out” of thepriors.

4.2 Problems with the Bayesian approach

Most of the controversy over the Bayesian method concerns theprobability assignment over hypotheses. One important set of problemssurrounds the interpretation of those probabilities as beliefs, as todo with a willingness to act, or the like. Another set of problemspertains to the determination of the prior probability assignment, andthe criteria that might govern it.

4.2.1 Interpretations of the probability over hypotheses

The overall question here is how we should understand the probabilityassigned to a statistical hypothesis. Naturally the interpretationwill be epistemic: the probability expresses the strength of belief inthe hypothesis. It makes little sense to attempt a physicalinterpretation since the hypothesis cannot be seen as a repeatableevent, or as an event that might have some tendency of occurring.

This leaves open several interpretations of the probability assignmentas a strength of belief. One very influential interpretation ofprobability as degree of belief relates probability to a willingnessto bet against certain odds (cf. Ramsey 1926, De Finetti 1937/1964,Earman 1992, Jeffrey 1992, Howson 2000). According to thisinterpretation, assigning a probability of $3/4$ to a proposition,for example, means that we are prepared to pay at most $0.75 for abetting contract that pays out $1 if the proposition is true, and thatturns worthless if the proposition is false. The claim that degrees ofbelief are correctly expressed in a probability assignment is thensupported by a so-calledDutch book argument: if an agentdoes not comply to the axioms of probability theory, a malignbookmaker can propose a set of bets that seems fair to the agent butthat lead to a certain monetary loss, and that is therefore calledDutch, presumably owing to the Dutch’s mercantile reputation.This interpretation associates beliefs directly with their behavioralconsequences: believing something is the same as having thewillingness to engage in a particular activity, e.g., in a bet.

There are several problems with this interpretation of the probabilityassignment over hypotheses. For one, it seems to make little sense tobet on the truth of a statistical hypothesis, because such hypothesescannot be falsified or verified. Consequently, a betting contract onthem will never be cashed. More generally, it is not clear thatbeliefs about statistical hypotheses are properly framed by connectingthem to behavior (cf. Armendt 1993). This way of framing probabilityassignments introduces pragmatic considerations on beliefs, to do withnavigating the world successfully, into a setting that may be moreconcerned with belief as a truthful representation of the world.

A somewhat different problem is that the Bayesian formalism, inparticular its use of probability assignments over statisticalhypotheses, suggests a remarkable closed-mindedness on the part of theBayesian statistician. Recall the example of the foregoing, with themodel $M = \{ h_{1/2}, h_{3/4} \}$. The Bayesian formalism requiresthat we assign a probability distribution over these two hypotheses,and further that the probability of the model is $P(M) = 1$. It isquite a strong assumption, even of an ideally rational agent, that sheis indeed equipped with a real-valued function that expresses heropinion over the hypotheses. Moreover, the probability assignment overhypotheses seems to entail that the Bayesian statistician is certainthat the true hypothesis is included in the model. This is an undulystrong claim to which a Bayesian statistician will have to commit atthe start of her analysis. It sits badly with broadly sharedmethodological insights (e.g., Popper 1934/1956), according to whichscientific theory must be open to revision at all times (cf. Mayo1996). In its standard form, Bayesian statistics does not do justiceto the nature of scientific inquiry.

The problem just outlined obtains a mathematically more sophisticatedform in the problem that Bayesians expect to be well-calibrated. Thisproblem, as formulated in Dawid (1982), concerns a Bayesianforecaster, e.g., a weatherman who determines a daily probability forprecipitation in the next day. It is then shown that such a weathermanbelieves of himself that in the long run he will converge onto thecorrect probability with certainty. Yet it seems reasonable to supposethat the weatherman realizes something could potentially be wrong withhis meteorological model, and so sets his probability for correctprediction below 1. The weatherman is thus led to incoherent beliefs.It seems that Bayesian statistical analysis places unrealisticdemands, even on an ideal agent.

4.2.2 Determination of the prior

Assuming that we have settled on a statistical model and that we caninterpret the probability over it as an expression of epistemicuncertainty, how do we determine a prior probability? Perhaps wealready have an intuitive judgment on the hypotheses in the model, sothat we can pin down the prior probability on that basis. Or else wemight have additional criteria for choosing our prior. However,several problems attach to procedures for determining the prior.

First consider the idea that the scientist who runs the Bayesiananalysis provides the prior probability herself. One obvious problemwith this idea is that the opinion of the scientist might not beprecise enough for a determination of a full prior distribution. Itdoes not seem realistic to suppose that the scientist can transformher opinion into a single real-valued function over the model,especially not if the model itself consists of a continuum ofhypotheses. But the more pressing problem is that different scientistswill provide different prior distributions, and that these differentpriors will lead to different statistical results. In other words,Bayesian statistical inference introduces an inevitable subjectivecomponent into scientific method.

It is one thing that the statistical results depend on the initialopinion of the scientist. But it may so happen that the scientist hasno opinion whatsoever about the hypotheses. How is she supposed toassign a prior probability to the hypotheses then? The prior will haveto express her ignorance concerning the hypotheses. The leading ideain expressing such ignorance is usually the principle of indifference:ignorance means that we are indifferent between any pair ofhypotheses. For a finite number of hypotheses, indifference means thatevery hypothesis gets equal probability. For a continuum ofhypotheses, indifference means that the probability density functionmust be uniform.

Nevertheless, there are different ways of applying the principle ofindifference and so there are different probability distributions overthe hypotheses that can count as expression of ignorance. This insightis nicely illustrated in Bertrand’s paradox .

Bertrand’s paradox
Consider a circle drawn through the corners of an equilateraltriangle, and imagine that a knitting needle whose length exceeds thecircle’s diameter is thrown onto the circle. What is theprobability that the section of the needle lying within the circle islonger than the side of the equilateral triangle? To determine theanswer, we need to parameterize the ways in which the needle may bethrown, determine the subset of parameter values for which theincluded section is indeed longer than the triangle’s side, andexpress our ignorance over the exact throw of the needle in aprobability distribution over the parameter, so that the probabilityof the said event can be derived. The problem is that we may provideany number of ways to parameterize how the needle lands in the circle.If we use the angle that the needle makes with the tangent of thecircle at the intersection with one of the triangle’s corners,then the included section of the needle is only going to be longer ifthe angle is between $60^{\circ}$ and $120^{\circ}$. If we assumethat our ignorance is expressed by a uniform distribution over theseangles, which ranges from $0^{\circ}$ to $180^{\circ}$, then theprobability of the event is going to be $1/3$. However, we can alsoparameterize the ways in which the needle lands differently, namely bythe shortest distance of the needle to the centre of the circle. Auniform probability over the distances will lead to a probability of$1/2$. Finally, parameterizing the needle’s position by thelocation of its midpoint will lead us to assign the same event aprobability of $1/4$.

Jaynes (1973 and 2003) provides a very insightful discussion of thisriddle and also argues that it may be resolved by relying oninvariances of the problem under certain transformations. But thegeneral message for now is that the principle of indifference does notautomaticallylead to a unique choice of priors. The point is not thatignorance concerning a parameter is hard to express in a probabilitydistribution over those values. It is rather that in some cases, we donot even know what parameters to use to express our ignoranceover.

In part the problem of the subjectivity of Bayesian analysis may beresolved by taking a different attitude to scientific theory, and bygiving up the ideal of absolute objectivity. Indeed, some will arguethat it is just right that the statistical methods accommodatedifferences of opinion among scientists. However, this response missesthe mark if the prior distribution expresses ignorance rather thanopinion: it seems harder to defend the rationality of differences ofopinion that stem from different ways of spelling out ignorance. Nowthere is also a more positive answer to worries over objectivity,based on so-called convergence results (e.g., Blackwell and Dubins1962, Gaifman and Snir 1982). It turns out that the impact of priorchoice diminishes with the accumulation of data, and that in the limitthe posterior distribution will converge to a set, possibly asingleton, of best hypotheses, determined by the sampled data andhence completely independent of the prior distribution. However, inthe short and medium run the influence of subjective prior choiceremains.

For better or worse, Bayesian statistics is sensitive to subjectiveinput. The undeniable advantage of the classical statisticalprocedures is that they do not need any such input, although arguablythe classical procedures are in turn sensitive to choices concerningthe sample space (Lindley 2000). Against this, Bayesian statisticianspoint to the advantage of being able to incorporate evidentiallyrelevantinitial opinions into the statistical analysis.

4.3 Responses to criticism

The philosophy of Bayesian statistics offers a wide range of responsesto the problems outlined above. Some Bayesians bite the bullet anddefend the essentially subjective character of Bayesian methods.Others attempt to remedy or compensate for the subjectivity, byproviding objectively motivated means of determining the priorprobability or by emphasizing the objective character of the Bayesianformalism itself.

4.3.1 Strict but empirically informed subjectivism

One very influential view on Bayesian statistics buys into thesubjectivity of the analysis (e.g., Goldstein 2006, Kadane 2011).So-called personalists or strict subjectivists argue that it is justright that the statistical methods do not provide any objectiveguidelines, pointing to radically subjective sources of any form ofknowledge. The problems on the interpretation and choice of the priordistribution are thus dissolved, at least in part: the Bayesianstatistician may choose her prior at will, and they are an expressionof her beliefs. However, it deserves emphasis that a subjectivist viewon Bayesian statistics does not mean that all constraints derivingfrom empirical fact can be disregarded. Nobody denies that if you havefurther knowledge that imposes constraints on the model or the prior,then those constraints must be accommodated. For example,today’s posterior probability may be used as tomorrow’sprior, in the next statistical inference. The point is that suchconstraints concern the rationality of belief and not the consistencyof the statistical inference per se.

Subjectivist views are most prominent among those who interpretprobability assignments in a pragmatic fashion, and motivate therepresentation of belief with probability assignments by theafore-mentioned Dutch book arguments. Central to this approach is thework of Savage and De Finetti. Savage (1962) proposed to axiomatizestatistics in tandem with decision theory, a mathematical theory aboutpractical rationality. He argued that by themselves the probabilityassignments do not mean anything at all, and that they can only beinterpreted in the context where an agent faces a choice betweenactions, i.e., a choice among a set of bets. In similar vein, DeFinetti (e.g., 1974) advocated a view on statistics in which only theempirical consequences of the probabilistic beliefs, expressed in awillingness to bet, mattered, although he did not make statisticalinference fully dependent on decision theory. While the approachesdiffer a great deal, it appears that the subjectivist view on Bayesianstatistics is based on the same behaviorism and empiricism thatmotivated Neyman and Pearson to develop classical statistics.

Notice that all this makes one aspect of the interpretation problem ofSection 4.2.1 reappear: how will the prior distribution over hypotheses make itselfapparent in behavior, so that it can rightfully be interpreted interms of belief, here understood as a willingness to act? One responseto this question is to turn to different motivations for representingdegrees of beliefs by means of probability assignments. Following workby De Finetti, several authors have proposed vindications ofprobabilistic expressions of belief that are not based on behavioralgoals, but rather on the epistemic goal of holding beliefs thataccurately represent the world, e.g., Rosenkrantz (1981), Joyce(1998), Leitgeb and Pettigrew (2010a and 2010b), Easwaran (2013). Astrong generalization of this idea is achieved in Schervish,Seidenfeld and Kadane (2009), which builds on a longer tradition ofusing scoring rules for achieving statistical aims like calibration oraccurate prediction. An alternative approach is that any formalrepresentation of belief must respect certain logical constraints,e.g., Cox (1961) provides an argument for the expression of belief interms of probability assignments on the basis of the nature of partialbelief per se.

The original subjectivist response to the issue that a prior overhypotheses is hard to interpret came from De Finetti’s so-calledrepresentation theorem, which shows that every prior distribution canbe associated with its own set of predictions, and hence with its ownbehavioral consequences. In other words, De Finetti showed how priorsare indeed associated with beliefs that can carry a bettinginterpretation, so that the prior over the hypotheses can beunderstood in empiricist terms after all.

4.3.2 Excursion: the representation theorem

De Finetti’s representation theorem relates rules forprediction, as functions of the given sample data, to Bayesianstatistical analyses of those data, against the background of astatistical model. See Festa (1996) and Suppes (2001) for usefulintroductions. De Finetti considers a process that generates a seriesof time-indexed observations, and he then studies prediction rulesthat take these finite segments as input and return a probability overfuture events, using a statistical model that can analyze such samplesand provide the predictions. The key result of De Finetti is that aparticular statistical model, namely the set of all distributions inwhich the observations are independently and identically distributed,can be equated with the class of exchangeable prediction rules, namelythe rules whose predictions do not depend on the order in which theobservations come in.

Let us consider the representation theorem in some more formal detail.For simplicity, say that the process generates time-indexed binaryobservations, i.e., 0s and 1s. The prediction rules take such bitstrings of length $t$, denoted $S_{t}$, as input, and return aprobability for the event that the next bit in the string is a 1,denoted $Q^{1}_{t+1}$. So we write the prediction rules as partialprobability assignments $P(Q^{1}_{t+1} \mid S_{t})$. Exchangeableprediction rules are rules that deliver the same predictionindependently of the order of the bits in the string $S_{t}$. If wewrite the event that the string $S_{t}$ has a total of $n$observations of 1s as $S_{n/t}$, then exchangeable prediction rulesare written as $P(Q^{1}_{t+1} \mid S_{n/t})$. The crucial propertyis that the value of the prediction is not affected by the order inwhich the 0s and 1s show up in the string $S_{t}$.

De Finetti relates this particular set of exchangeable predictionrules to a Bayesian inference over a specific type of statisticalmodel. The model that De Finetti considers comprises the so-calledBernoulli hypotheses $h_{\theta}$, i.e., hypotheses for which

\[ P(Q^{1}_{t+1} \mid h_{\theta} \cap S_{t}) = \theta . \]

This likelihood does not depend on the string $S_{t}$ that has gonebefore. The hypotheses are best thought of as determining a fixed bias$\theta$ for the binary process, where $\theta \in \Theta = [0,1]$. The representation theorem states that there is a one-to-onemapping of priors over Bernoulli hypotheses and exchangeableprediction rules. That is, every prior distribution $P(h_{\theta})$can be associated with exactly one exchangeable prediction rule$P(Q^{1}_{t+1} \mid S_{n/t})$, and conversely. Next to the originalrepresentation theorem derived by De Finetti, several other and moregeneral representation theorems were proved, e.g., for partiallyexchangeable sequences and hypotheses on Markov processes (Diaconisand Freedman 1980, Skyrms 1991), for clustering predictions andpartitioning processes (Kingman 1975 and 1978), and even for sequencesof graphs and their generating process (Aldous 1981).

Representation theorems equate a prior distribution over statisticalhypotheses to a prediction rule, and thus to a probability assignmentthat can be given a subjective and behavioral interpretation. Thisremoves the worry expressed above, that the prior distribution overhypotheses cannot be interpreted subjectively because it cannot berelated to belief as a willingness to act: priors relate uniquely toparticular predictions. However, for De Finetti the representationtheorem provided a reason for doing away with statistical hypothesesaltogether, and hence for the removal of a notion of probability asanything other than subjective opinion (cf. Hintikka 1970): hypotheseswhose probabilistic claims could be taken to refer to intangiblechancy processes are superfluous metaphysical baggage.

Not all subjectivists are equally dismissive of the use of statisticalhypotheses. Jeffrey (1992) has proposed so-called mixed Bayesianism inwhich subjectively interpreted distributions over the hypotheses arecombined with a physical interpretation of the distributions thathypotheses define over sample space. Romeijn (2003, 2005, 2006) arguesthat priors over hypotheses are an efficient and intuitive way ofdetermining inductive predictions than specifying properties ofpredictive systems directly. This seems in agreement with the practiceof science, in which hypotheses are routinely used, and oftenmotivated by mechanistic knowledge on the data generating process.

4.3.3 Bayesian statistics as logic

Despite the arguably subjective character of the prior, there is asense in which Bayesian statistics might lay claim to objectivity. Itcan be shown that the Bayesian formalism meets certain objectivecriteria of rationality, coherence, and calibration. Bayesianstatistics thus answers to the requirement of objectivity at ameta-level: while the opinions that it deals with retain a subjectiveaspect, the way in which it deals with these opinions, in particularthe way in which data impacts on them, is objectively correct, or soit is argued. Arguments supporting the Bayesian way of accommodatingdata, namely by conditionalization, have been provided in a pragmaticcontext by dynamic Dutch book arguments, whereby probability isinterpreted as a willingness to bet (cf. Maher 1993, van Fraassen1989). Similar arguments have been advanced on the grounds that ourbeliefs must accurately represent the world along the lines of DeFinetti (1974), e.g., Greaves and Wallace (2006) and Leitgeb andPettigrew (2010).

An important distinction must be made in arguments that support theBayesian way of accommodating evidence, namely between themathematical fact of Bayes’ theorem, as introduced insection 4.1.1, and the epistemic principle of Bayes’ rule, which ensures thecoherence of belief states over time. The theorem is simply a relationamong probability assignments, and as such not subject to debate.Arguments that support the representation of the epistemic state of anagent by means of probability assignments also provide support forBayes’ theorem as a constraint on degrees of belief. Theconditional probability $P(h \mid s)$ can be interpreted as thedegree of belief attached to the hypothesis $h$ on the suppositionthat the sample $s$ is obtained. Bayes’ rule, by contrast,presents a constraint on probability assignments that representepistemic states of an agent at different points in time. It iswritten as

\[ P_{s}(h) \; = P(h \mid s) , \]

and it determines that the new probability assignment, expressing theepistemic state of the agent after the sample has been obtained, issystematically related to the old assignment, representing theepistemic state before the sample came in. In the philosophy ofstatistics many Bayesians adopt Bayes’ rule implicitly, and inthe philosophical debate over learning Bayes’ rule is a centraltenet (cf. Huttegger 2017). But in what follows we will assume thatBayesian statistical inferences rely on Bayes’ theorem only.

Whether the focus lies on Bayes’ rule or on Bayes’theorem, the common theme in the above-mentioned arguments is thatthey approach Bayesian statistical inference from a logical angle, andfocus on its internal coherence or consistency (cf. Howson 2003).While its use in statistics is undeniably inductive, Bayesianinference thereby obtains a deductive, or at least non-ampliativecharacter: everything that is concluded in the inference is somehowalready present in the premises. In Bayesian statistical inference,those premises are given by the prior over the hypotheses,$P(h_{\theta})$ for $\theta \in \Theta$, and the likelihoodfunctions, $P(s \mid h_{\theta})$, as determined for each hypothesis$h_{\theta}$ separately. These premises fix a single probabilityassignment over the space $M \times S$ at the outset of theinference. The conclusions, in turn, are straightforward consequencesof this probability assignment. They can be derived by applyingtheorems of probability theory, most notably Bayes’ theorem.Bayesian statistical inference thus becomes an instance ofprobabilistic logic (cf. Hailperin 1986, Halpern 2003, Haennietal 2011).

Summing up, there are several arguments showing that statisticalinference by Bayes’ theorem, or by Bayes’ rule, isobjectively correct. These arguments invite us to consider Bayesianstatistics as an instance of probabilistic logic. Appeals to thelogicality of Bayesian statistical inference may provide a partialremedy for its subjective character. Moreover, a logical approach tothe statistical inferences avoids the problem that the formalismplaces unrealistic demands on the agents, and that it presumes theagent to have certain knowledge. Much like in deductive logic, we neednot assume that the inferences are psychologically realistic, nor thatthe agents actually believe the premises of the arguments. Rather thearguments present the agents with a normative ideal and take theconditional form of consistency constraints: if you accept thepremises, then these conclusions follow.

4.3.4 Excursion: inductive logic and statistics

An important instance of probabilistic logic is presented in inductivelogic, as devised by Carnap, Hintikka and others (Carnap 1950 and1952, Hintikka and Suppes 1966, Carnap and Jeffrey 1970, Kuipers 1978,Hintikka and Niiniluoto 1980, Paris 1994, Skyrms 1999, Nix and Paris2006, Paris and Waterhouse 2009, Paris and Vencovská 2015).Historically, Carnapian inductive logic developed prior to theprobabilistic logics referenced above, and more or less separatelyfrom the debates in the philosophy of statistics. But the logicalsystems of Carnap can quite easily be placed in the context of alogical approach to Bayesian inference, and doing this is in factquite insightful.

For simplicity, we choose a setting that is similar to the one used inthe exposition of the representation theorem, namely a binary datagenerating process, i.e., strings of 0s and 1s. A prediction ruledetermines a probability for the event, denoted $Q^{1}_{t+1}$, thatthe next bit in the string is a 1, on the basis of a given string ofbits with length $t$, denoted by $S_{t}$. Carnap and followersdesigned specific exchangeable prediction rules, mostly variants ofthe straight rule (Reichenbach 1938),

\[ P(Q^{1}_{t+1} \mid S_{n/t}) = \frac{n + 1}{t + 2} , \]

where $S_{n/t}$ denotes a string of length $t$ of which $n$entries are 1s. Carnap derived such rules from constraints on theprobability assignments over the samples. Some of these constraintsboil down to the axioms of probability. Other constraints,exchangeability among them, are independently motivated, by an appealto so-called logical interpretation of probability. Under this logicalinterpretation, the probability assignment must respect certaininvariances under transformations of the sample space, in analogy tological principles that constrain truth valuations over a language ina particular way.

Carnapian inductive logic is an instance of probabilistic logic,because its sequential predictions are all based on a singleprobability assignment at the outset, and because it relies onBayes’ theorem to adapt the predictions to sample data (cf.Romeijn 2011). One important difference with Bayesian statisticalinference is that, for Carnap, the probability assignment specified atthe outset only ranges over samples and not over hypotheses. However,by De Finetti’s representation theorem Carnap’sexchangeable rules can be equated to particular Bayesian statisticalinferences. A further difference is that Carnapian inductive logicgives preferred status to particular exchangeable rules. In view of DeFinetti’s representation theorem, this comes down to the choicefor a particular set of preferred priors. As further developed below,Carnapian inductive logic is thus related to objective Bayesianstatistics. It is a moot point whether further constraints on theprobability assignments can be considered as logical, as Carnap andfollowers have it, or whether the title of logic is best reserved forthe probability formalism in isolation, as De Finetti and followersargue.

4.3.5 Objective priors

A further set of responses to the subjectivity of Bayesian statisticalinference targets the prior distribution directly: we might providefurther rationality principles, with which the choice of priors can bechosen objectively. The literature proposes several objective criteriafor filling in the prior over the model. Each of these lays claim tobeing the correct expression of complete ignorance concerning thevalue of the model parameters, or of minimal information regarding theparameters. Three such criteria are discussed here.

In the context of Bertrand’s paradox we already discussed theprinciple of indifference, according to which probability should bedistributed evenly over the available possibilities. A furtherdevelopment of this idea is presented by the requirement that adistribution should have maximum entropy. Notably, the use of entropymaximization for determining degrees of beliefs finds much broaderapplication than only in statistics: similar ideas are taken up indiverse fields like epistemology (e.g., Shore and Johnson 1980,Williams 1980, Uffink 1996, and also Williamson 2010), inductive logic(Paris and Vencovska 1989), statistical mechanics (Jaynes 2003) anddecision theory (Seidenfeld 1986, Grunwald and Halpern 2004). Inobjective Bayesian statistics, the idea is applied to theprior distribution over the model (cf. Berger 2006). For a finitenumber of hypotheses the entropy of the distribution $P(h_{\theta})$is defined as

\[ E[P] \; = \; \sum_{\theta \in \Theta} P(h_{\theta}) \log P(h_{\theta}) . \]

This requirement leads to equiprobable hypotheses. However, forcontinuous models the maximum entropy distribution depends cruciallyon the metric over the parameters in the model. The burden ofsubjectivity is thereby moved to the parameterization. But it may wellbe that we have strong reasons for preferring a particularparameterization over others (cf. Jaynes 1973).

There are other approaches to the objective determination of priors.In view of the above problems, a particularly attractive method forchoosing a prior over a continuous model is proposed by Jeffreys(1961). The general idea of so-called Jeffreys priors is that theprior probability assigned to a small patch in the parameter space isproportional to, what may be called, the density of the distributionswithin that patch. Intuitively, if a lot of distributions, i.e.,distributions that differ among themselves, are packed together on asmall patch in the parameter space as measured by the base metric,this patch should be given a larger prior probability than a similarpatch within which there is little variation among the distributions(cf. Balasubramanian 2005). More technically, such a density isexpressed by a prior distribution that is proportional to the Fisherinformation. A key advantage of these priors is that they areinvariant under reparameterizations of the parameter space: a newparameterization naturally leads to an adjusted density ofdistributions.

A final method of defining priors goes under the name of referencepriors (Berger et al 2009). The proposal starts from the observationthat we should minimize the subjectivity of the results of ourstatistical analysis, and hence that we should minimize the impact ofthe prior probability on the posterior. The idea of reference priorsis exactly that it will allow the sample data a maximal say in theposterior distribution. But since at the outset we do not know whatsample we will obtain, the prior is chosen so as to maximize theexpected impact of the data. Of course this expectation must itself betaken with respect to some distribution over sample space, and someassociated measure of impact.

4.3.6 Generalizing or revising priors

A different response to the subjectivity of priors is to extend theBayesian formalism, in order to leave the choice of prior to someextent open. The subjective choice of a prior is in that casegeneralized away from, or made subject to alteration. Three suchresponses will be considered.

Recall that a prior probability distribution over statisticalhypotheses expresses our uncertain opinion on which of the hypothesesis right. The central idea behind hierarchical Bayesian models (Gelmanet al 2013) is that the same pattern of putting a prior overstatistical hypotheses can be repeated on the level of priors itself.More precisely, we may be uncertain over which prior probabilitydistribution over the hypotheses is right. If we characterize possiblepriors by means of a set of parameters, we can express thisuncertainty about prior choice in a probability distribution over theparameters that characterize the shape of the prior. In other words,we move our uncertainty one level up in a hierarchy: we considermultiple priors over the statistical hypotheses, and compare theperformance of these priors on the sample data as if the priors werethemselves hypotheses.

The idea of hierarchical Bayesian modeling (Gelman et al 2013) relatesnaturally to the Bayesian comparison of Carnapian prediction rules(e.g., Skyrms 1993 and 1996, Festa 1996), and also to the estimationof optimum inductive methods (Kuipers 1986, Festa 1993). HierarchicalBayesian modeling can also be related to another tool for choosing aparticular prior distribution over hypotheses, namely the method ofempirical Bayes, which estimates the prior that leads to the maximalmarginal likelihood of the model. In the philosophy of science,hierarchical Bayesian modeling has made a first appearance due toHenderson et al. (2010).

There is also a response that avoids the choice of a prior altogether.This response starts with the same idea as hierarchical models: ratherthan considering a single prior over the hypotheses in the model, weconsider a parameterized set of them. But instead of defining adistribution over this set, proponents of interval-valued or impreciseprobability claim that our epistemic state regarding the priors isbetter expressed by this set of distributions, and that sharpprobability assignments must therefore be replaced by lower and upperbounds to the assignments. Now the idea that uncertain opinion is bestcaptured by a set of probability assignments, or a credal set forshort, has a long history and is backed by an extensive literature(e.g., De Finetti 1974, Levi 1980, Dempster 1967 and 1968, Shafer1976, Walley 1991, Augustin et al 2014). In light of the main debatein the philosophy of statistics, the use of interval-valued priorsindeed forms an attractive extension of Bayesian statistics: it allowsus to refrain from choosing a specific prior, and thereby presents arapprochement to the classical view on statistics.

These theoretical developments are attractive but sadly they mostlyenjoy a cult status among philosophers of statistics and that theyhave not moved the applied statistician in the street. On the otherhand, standard Bayesian statistics has seen a steep rise in popularityover the past decade or so, owing to the availability of good softwareand numerical approximation methods. And most of the practical use ofBayesian statistics is more or less insensitive to the potentiallysubjective aspects of the statistical results, employing uniformpriors as a neutral starting point for the analysis and relying on theafore-mentioned convergence results to wash out the remainingsubjectivity (cf. Gelman and Shalizi 2013). However, this practicalattitude of scientists towards modelling should not be mistaken for aprincipled answer to the questions raised in the philosophy ofstatistics (see for example Morey et al 2013).

A final reponse adopts the more Popperian viewpoint about statisticalmodelling expounded in Gelman and Shalizi (2013), and makes the choiceof prior subject to revision explicitly. Hierarchical Bayesianmodelling goes some way towards this idea, by allowing us to considera collection of priors and adapt it on the basis of our observations.This idea can be expanded to include the revision of the whole model,i.e., the collection of statistical hypotheses over which the prior isdefined (cf. Romeijn 2005). Revisions of this kind relate to changesin the language, or the conceptual scheme, by means of which we arelearning from the observations (Gillies 2001, Williamson 2005). Suchrevisions go beyond the usual method of Bayesian updating but they canstill be regimented by rationality constraints, for example byrequiring approximate coherence or conservativity in some way (Moreyet al. 2013, Wenmackers and Romeijn 2015). Further rationalityrequirements pertain to what might motivate model revisions:predictive underperfomance, changes in salience or awareness, and soon. It is to such model evaluations that we now turn.

5. Statistical models

In the foregoing we have seen how classical and Bayesian statisticsdiffer. But the two major approaches to statistics also have a lot incommon. Most importantly, all statistical procedures rely on theassumption of a statistical model, here referring to any restrictedset of statistical hypotheses. Moreover, they are all aimed atdelivering something of a verdict over these hypotheses. For example,a classical likelihood ratio test considers two hypotheses, $h$ and$h'$, and then offers a verdict of rejection and acceptance, while aBayesian comparison delivers a posterior probability over these twohypotheses. Whereas in Bayesian statistics the model presents a verystrong assumption, classical statistics does not endow the model witha special epistemic status: they are simply the hypotheses currentlyentertained by the scientist. Still, the adoption of a model isabsolutely central to any statistical procedure.

A natural question is whether anything can be said about the qualityof the statistical model, and whether any verdict on this startingpoint for statistical procedures can be given. While it is hard todetermine the truth or falsity of a model, some models will lead tobetter predictions, or be a better guide to the truth, than others,inviting the slogan that “all models are wrong but some areuseful” (cf. Wit et al. 2011). The evaluation of models toucheson deep issues in the philosophy of science, because the statisticalmodel often determines how the data-generating system underinvestigation is conceptualized and approached (Kieseppa 2001). Modelchoice thus resembles the choice of a theory, a conceptual scheme, oreven of a whole paradigm, and thereby might seem to transcend theformal frameworks for studying theoretical rationality (cf. Carnap1950, Jeffrey 1980). Despite the fact that some considerations onmodel choice will seem extra-statistical, in the sense that they falloutside the scope of statistical treatment, statistics offers severalmethods for approaching the choice of statistical models.

5.1 Model comparisons

There are in fact very many methods for evaluating statistical models(Claeskens and Hjort 2008, Wagenmakers and Waldorp 2006). In firstinstance, the methods occasion the comparison of statistical models,but very often they are used for selecting one model over the others.In what follows we only review prominent techniques that have led tophilosophical debate: Akaike’s information criterion, theBayesian information criterion, and furthermore the computation ofmarginal likelihoods and posterior model probabilities, bothassociated with Bayesian model selection. We leave aside methods thatuse cross-validation as they have, unduly, not received as muchattention in the philosophical literature. The connection of modelselection to conceptions of simplicity is briefly considered.

5.1.1 Akaike’s information criterion

Akaike’s information criterion, modestly termed An InformationCriterion or AIC for short, is based on the classical statisticalprocedure of estimation (see Burnham and Anderson 2002, Kieseppa1997). It starts from the idea that a model $M$ can be judged by theestimate $\hat{\theta}$ that it delivers, and more specifically bythe proximity of this estimate to the distribution with which the dataare actually generated, i.e., the true distribution. This proximity isoften equated with the expected predictive accuracy of the estimate,because if the estimate and the true distribution are closer to eachother, their predictions will be better aligned to one another aswell. In the derivation of the AIC, the so-called relative entropy orKullback-Leibler divergence of the two distributions is used as ameasure of their proximity, and hence as a measure of the expectedpredictive accuracy of the estimate.

Naturally, the true distribution is not known to the statistician whois evaluating the model. If it were, then the whole statisticalanalysis would be useless. However, it turns out that we can give anunbiased estimation of the divergence between the true distributionand the distribution estimated from a particular model,

\[ \text{AIC}[M] = - 2 \log P( s \mid h_{\hat{\theta}(s)} ) + 2 d , \]

in which $s$ is the sample data, $\hat{\theta}(s)$ is the maximumlikelihood estimate (MLE) of the model $M$, and $d = dim(\Theta)$is the number of dimensions of the parameter space of the model. TheMLE of the model thereby features in an expression of the modelquality, i.e., in a role that is conceptually distinct from theestimator function.

As can be seen from the expression above, a model with a smaller AICis preferable: we want the fit to be optimal at little cost incomplexity. Notice that the number of dimensions, or independentparameters, in the model increases the AIC and thereby lowers theeligibility of the model: if two models achieve the same maximumlikelihood for the sample, then the model with fewer parameters willbe preferred. For this reason, statistical model selection by the AICcan be seen as an independent motivation for preferring simple modelsover more complex ones (Sober and Forster 1994). But this result alsoinvites some critical remarks. For one, we might impose other criteriathan merely the unbiasedness on the estimation of the proximity to thetruth, and this will lead to different expressions for theapproximation. Moreover, it is not always clearcut what the dimensionsof the model under scrutiny really are. For curve fitting this mayseem simple, but for more complicated models or differentconceptualizations of the space of models, things do not look so easy(cf. Myung et al 2001, Kieseppa 2001, Romeijn 2017). The complexity ofstatistical models and its relation to learning have received moresystematic attention in statistical learning theory (Vapnik 2000,Harman and Kulkarni 2007) and there are extensive discussionsconnecting it up to a broader philosophical notion of simplicity(e.g., Sober 2004).

A primary example of model selection is presented in curve fitting.Given a sample $s$ consisting of a set of points in the plane $(x,y)$, we are asked to choose the curve that fits these data best. Weassume that the models under consideration are of the form $y = f(x)+ \epsilon$, where $\epsilon$ is a normal distribution with mean 0and a fixed standard deviation, and where $f$ is a polynomialfunction. Different models are characterized by polynomials ofdifferent degrees that have different numbers of parameters.Estimations fix the parameters of these polynomials. For example, forthe 0-degree polynomial $f(x) = c_{0}$ we estimate the constant$\hat{c_{0}}$ for which the probability of the data is maximal, andfor the 1-degree polynomial $f(x) = c_{0} + c_{1}\, x$ we estimatethe slope $\hat{c_{1}}$ and the offset $\hat{c_{0}}$. Now noticethat for a total of $n+$1 points, we can always find a polynomial ofdegree $n$ that intersects with all points exactly, resulting in acomparatively high maximum likelihood $P(s \mid \{\hat{c_{0}}, \ldots\hat{c_{n}} \})$. Applying the AIC, however, we will typically findthat some model with a polynomial of degree $k < n$ ispreferable. Although $P(s \mid \{\hat{c_{0}}, \ldots \hat{c_{k}}\})$ will be somewhat lower, this is compensated for in the AIC bythe smaller number of parameters.

5.1.2 Bayesian model evaluation and beyond

Various other prominent model selection tools are based on methodsfrom Bayesian statistics. They all start from the idea that thequality of a model is expressed in the performance of the model on thesample data: the model that, on the whole, makes the sampled data mostprobable is to be preferred. Because of this, there is a closeconnection with the hierarchical Bayesian modelling referred toearlier (Gelman 2013). The central notion in the Bayesian modelselection tools is thus the marginal likelihood of the model, i.e.,the weighted average of the likelihoods over the model, using theprior distribution as a weighing function:

\[ P(s \mid M_{i}) \; = \; \int_{\theta \in \Theta_{i}} P(h_{\theta}) P(s \mid h_{\theta}) d\theta . \]

Here $\Theta_{i}$ is the parameter space belonging to model$M_{i}$. The marginal likelihoods can be combined with a priorprobability over models, $P(M_{i})$, to derive the so-calledposterior model probability, using Bayes’ theorem. One way ofevaluating models, known as Bayesian model selection, is by comparingthe models on their marginal likelihood, or else on their posteriors(cf. Kass and Raftery 1995).

Usually the marginal likelihood cannot be computed analytically.Numerical approximations can often be obtained, but for practicalpurposes it has proved very useful, and quite sufficient, to employ anapproximation of the marginal likelihood. This approximation hasbecome known as the Bayesian information criterion, or BIC for short(Schwarz 1978, Raftery 1995). It turns out that this approximationshows remarkable similarities to the AIC:

\[ \text{BIC}[M] \; = \; - 2 \log P(s \mid h_{\hat{\theta}(s)}) + d \log n . \]

Here $\hat{\theta}(s)$ is again the maximum likelihood estimate ofthe model, $d = dim(M)$ the number of independent parameters, and$n$ is the number of data points in the sample. The latterdependence is the only difference with the AIC, but it is a majordifference in how the model evaluation may turn out.

The concurrence of the AIC and the BIC seems to give a furthermotivation for our intuitive preference for simple models over morecomplex ones. Indeed, other model selection tools, like the devianceinformation criterion (Spiegelhalter et al 2002) and the approachbased on minimum description length (Grunwald 2007), also result inexpressions that feature a term that penalizes complex models.However, as intimated earlier, the dimension term that we know fromthe information criteria does not exhaust the notion of modelcomplexity. There is ongoing debate in the philosophy of scienceconcerning the merits of model selection in explications of the notionof simplicity, informativeness, and the like (see, for example,Romeijn and van de Schoot 2008, Romeijn et al 2012, Steele and Werndl2013, Sprenger 2013, Sober 2015, Autzen 2016). Besides the debate overthe evaluation of and choice between statistical models, there is asome philosophical interest in statistical meta-analysis, i.e., thepractice of combining or aggregating models (e.g., Stegenga 2011).Considering the importance of meta-analyses as a means to integrateand systematize the ever growing amount of research findings, this isan area that deserves more philosophical scrutiny.

An interesting new development in philosophical discussions oninduction is the use of so-called meta-induction (cf. Cesa-Bianchi andLugosi 2006, Schurz 2019, Sterkenburg 2020). Its basic idea relates toensemble methods and statistical model evaluation: rather than asingle model and concomitant predictive system, we consider acollection of predictors and deploy them according to their pastperformance, e.g., by making predictions based on aperformance-weighted average of the models. There are obviousconnections between meta-induction, model evaluation and statisticalmeta-analysis that merit further philosophical attention.

The general idea of using a wide range of models can also beidentified in a rather new field on the intersection of Bayesianstatistics and machine learning, Bayesian nonparametrics (e.g., Orbanzand Teh 2010, Hjort et al 2010). Rather than specifying, at theoutset, a confined set of distributions from which a statisticalanalysis is supposed to choose on the basis of the data, the idea isthat the data are confronted with a potentially infinite-dimensionalspace of possible distributions. The set of distributions taken intoconsideration is then made relative to the data obtained: thecomplexity of the model grows with the sample. The result is apredictive system that performs an online model selection alongside aBayesian accommodation of the posterior over the model.

5.2 Statistics without models

There are also statistical methods that refrain from the use of aparticular model, by focusing exclusively on the data and deployingtheir structure, or by generalizing away from a specific choice ofmodel. Some of these techniques are properly localized in descriptivestatistics: they do not concern an inference from data but merelyserve to describe the data in a particular way; principal componentanalysis is a primary example of this. Other techniques do deriveinductive conclusions from data without explicitly adopting a model,either by relying on other assumptions that pertain to the applicationdomain of the techniques, or by adopting constraints in a differentway. Statistical methods that do not rely on an explicit model choicehave unfortunately not attracted much attention in the philosophy ofstatistics, while data-driven methods enjoy increased popularity inscientific research. We will briefly discuss some of these methodshere.

5.2.1 Data reduction techniques

One set of methods, and a quite important one for many practicingstatisticians, is aimed at data reduction. Often the sample data arevery rich, e.g., consisting of a set of points in a space of very manydimensions. The first step in a statistical analysis may then be toautomatically cluster or label the data, or to pick out the salientvariability in the data, in order to scale down the computationalburden of the analysis itself.

The technique of principal component analysis (PCA) is designed forthe latter purpose (Jolliffe 2002). Given a set of points in a space,it seeks out the set of vectors along which the variation in thepoints is large. As an example, consider two points in a planeparameterized as $(x, y)$: the points $(0, 0)$ and $(1, 1)$. Inthe $x$-direction and in the $y$-direction the variation is $1$,but over the diagonal the variation is maximal, namely $\sqrt{2}$.The vector on the diagonal is called the principal component of thedata. In richer data structures, and using a more general measure ofvariation among points, we can find the first component in a similarway. Moreover, we can repeat the procedure after subtracting thevariation along the last found component, by projecting the data ontothe plane perpendicular to that component. This allows us to build upa set of principal components of diminishing importance.

PCA is only one item from a large collection of techniques that areaimed at keeping the data manageable and finding patterns in it, acollection that also includes kernel methods and support vectormachines (e.g., Vapnik and Kotz 2006). For present purposes, it isimportant to stress that such tools should not be confused withstatistical analysis: they do not involve the testing or evaluation ofdistributions over sample space, even though they build up andevaluate models of the data. This sets them apart from, e.g.,confirmatory and exploratory factor analysis (Bartholomew 2008), whichare sometimes taken to be close relatives of PCA because both sets oftechniques allows us to identify salient dimensions within samplespace, along which the data show large variation.

Practicing statisticians often employ data reduction tools to arriveat conclusions on the distributions from which the data were sampled.There is already a wide use for machine learning and data miningtechniques in the sciences, and we may expect even mode usage of thesetechniques in the future, because so much data is now coming availablefor scientific analysis. Moreover, recent statistical researchsuggests that data reduction, specifically kernel methods, is at theheart of modern machine learning methods, especially deep learningneural networks (cf. Belkin 2020, Bartlett et al. 2021). However, inthe philosophy of statistics there is as yet relatively little debateover the epistemic status of conclusions reached by means of thesetechniques. Philosophers of statistics would do well to direct someattention here.

5.2.2 Formal and statistical learning theory

Different approaches to induction, adjacent to and partly overlappingwith statistics, are presented by formal and statistical learning.This is again a vast area of research, located between statistics,computer science and artificial intelligence. The approaches are onlybriefly mentioned here, as examples of how we can achieve certainstatistical aims, namely the identification of stable patterns indata, while in some sense avoiding the choice of a statistical model.We leave aside how formal and statistical learning can be implementedin a computer or in some other cognitive architecture. Instead wefocus, necessarily briefly, on the theory of learning algorithms.

Pioneering work on formal and statistical learning was done bySolomonoff (1964). The setting is the one of inductive logic, with thedata consisting of strings of 0s and 1s, and a predictor who attemptsto identify the pattern in these data. So, for example, the data maybe a string of the form $0101010101\ldots$, and the challenge is toidentify this strings as an alternating sequence. The central idea ofSolomonoff is that, to achieve universal induction, all possiblecomputable patterns must be considered. Solomonoff then proceeded todefine a formal system in which indeed all patterns are taken intoconsideration, effectively using a Bayesian analysis with a cleverlyconstructed but non-computable prior over all computable hypotheses. Acomprehensive discussion of universal prediction methods followingSolomonoff’s idea is offered in Sterkenburg (2018).

In theoretical computer science, the analysis of probabilisticprediction methods developed into statistical learning theory (Vapnik2000). With the advance of machine learning as a supplement to, oreven a replacement of statistical methods, philosophical attention forthis theory has recently increased (e.g., Herrmann 2020, Sterkenburgand Grünwald 2021). A closely related research area intheoretical computer science and philosophy that holds strong ties tothe philosophy of statistics isformal learning theory (e.g., Kelly 1996, Kelly et al 1997). The approach originally coveredcomputable learning of non-statistical patterns and formal languages(e.g., Putnam 1963, Gold 1967). More recently, formal learning theoryhas been extended to hypotheses of any kind, including the learning ofstatistical and causal models (cf. Genin and Kelly 2017). What isultimately distinctive about the learning theoretic approach is thatthe principal focus is on finding the truth. Other considerations,such as probabilistic coherence, are either derived from optimallearning performance (Kelly 2007) or are viewed as secondaryconstraints that may even hinder the performance of computationallybounded agents (Osherson et al. 1988).

6. Selected related topics

There are numerous topics in the philosophy of science that beardirect relevance to the themes covered in this lemma. A few centraltopics are mentioned here to direct the reader to related lemmas inthe encyclopedia.

One very important topic that is immediately adjacent to thephilosophy of statistics isconfirmation theory, the philosophical theory that describes and justifies relationsbetween scientific theory and empirical evidence. Arguably, the theoryof statistics is a proper part of confirmation theory, as it describesand justifies the relation that obtains between statistical theory andevidence in the form of samples. It can be insightful to placestatistical procedures in this wider framework of relations betweenevidence and theory. Zooming out further, the philosophy of statisticsis part of the philosophical topic of methodology, i.e., the generaltheory on whether and how science acquires knowledge. Thus conceived,statistics is one component in a large collection of scientificmethods comprising concept formation, experimental design,manipulation and observation, confirmation, revision, andtheorizing.

While these topics have always been important within epistemology andthe philosophy of science, developments in the sciences have triggeredinterests in them from a wider community of scientists, science policymakers, and the general public: the so-called replication crisis,i.e., the discovery that in repetitions of research executed earlierwe cannot recover many of the findings from the social and medicalsciences. Some have pointed to problems with the statistical analysesin these sciences, e.g., the use of error probabilities as measures ofhow well the findings are established, as a possible explanation forthe crisis (e.g., Ioannidis 2005), and this in turn has sparked broadinterest into the philosophy of statistics.

There are also a fair number of specific topics from the philosophy ofscience that are spelled out in terms of statistics or that arelocated in close proximity to it. One of these topics is the processof measurement, in particular the measurement of latent variables onthe basis of statistical facts about manifest variables. The so-calledrepresentational theory of measurement (Kranz et al. 1971)relies on statistics, in particular on factor analysis, to provide aconceptual clarification of how mathematical structures representempirical phenomena. Another important topic form the philosophy ofscience is causation (see the entries onprobabilistic causation andReichenbach’s common cause principle). Philosophers have employed probability theory to capture causalrelations ever since Reichenbach (1956), but more recent work incausality and statistics (e.g., Spirtes et al 2001) has given thetheory ofprobabilistic causality an enormous impulse. Hereagain, statistics provides a basis for the conceptual analysis ofcausal relations.

And there is so much more. Several specific statistical techniques,like factor analysis and the theory of Bayesian networks, inviteconceptual discussion of their own accord. Numerous topics within thephilosophy of science lend themselves to statistical elucidation,e.g., the coherence, informativeness, and surprise of evidence. And inturn there is a wide range of discussions in the philosophy of sciencethat inform a proper understanding of statistics. Among them aredebates over experimentation and intervention, concepts of chance, thenature of scientific models, and theoretical terms. The reader isinvited to consult the entries on these topics to find furtherindications of how they relate to the philosophy of statistics.

Bibliography

Aldous, D.J., 1981, “Representations for PartiallyExchangeable Arrays of Random Variables”,Journal ofMultivariate Analysis, 11: 581–598.
Armendt, B., 1993, “Dutch books, Additivity, and UtilityTheory”,Philosophical Topics, 21: 1–20.
Auxier, R.E., and L.E. Hahn (eds.), 2006,The Philosophy ofJaakko Hintikka, Chicago: Open Court.
Balasubramanian, V., 2005, “MDL, Bayesian Inference, and theGeometry of the Space of Probability Distributions”, in:Advances in Minimum Description Length: Theory and Applications, P.J.Grunwald et al (eds.), Boston: MIT Press, 81–99.
Bandyopadhyay, P., and Forster, M. (eds.), 2011, Handbook for thePhilosophy of Science: Philosophy of Statistics, Elsevier.
Barnett, V., 1999,Comparative Statistical Inference,Wiley Series in Probability and Statistics, New York: Wiley.
Bartholomew, D.J., F. Steele, J. Galbraith, I. Moustaki, 2008,Analysis of Multivariate Social Science Data, Statistics inthe Social and Behavioral Sciences Series, London: Taylor and Francis,2nd edition.
Bartlett, P. L., A. Montanari, A. Rakhlin, 2021, “DeepLearning: a Statistical Viewpoint”,Acta Numerica, 30:87–201.
Belkin, M., 2021, “Fit without Fear: Remarkable MathematicalPhenomena of Deep Learning through the Prism of Interpolation”,Acta Numerica, 30: 203–248.
Berger, J. 2006, “The Case for Objective BayesianAnalysis”,Bayesian Analysis, 1(3): 385–402.
Berger, J.O., J.M. Bernardo, and D. Sun, 2009, “The FormalDefinition of Reference Priors”,Annals of Statistics,37(2): 905–938.
Berger, J.O., and R.L. Wolpert, 1984,The LikelihoodPrinciple. Hayward (CA): Institute of MathematicalStatistics.
Berger, J.O. and T. Sellke, 1987, “Testing a Point NullHypothesis: The Irreconciliability of P-values and Evidence”,Journal of the American Statistical Association, 82:112–139.
Bernardo, J.M. and A.F.M. Smith, 1994,Bayesian Theory,New York: John Wiley.
Bigelow, J. C., 1977, “Semantics of Probability”,Synthese, 36(4): 459–72.
Billingsley, P., 1995,Probability and Measure, WileySeries in Probability and Statistics, New York: Wiley, 3rdedition.
Birnbaum, A., 1962, “On the Foundations of StatisticalInference”,Journal of the American StatisticalAssociation, 57: 269–306.
Blackwell, D. and L. Dubins, 1962, “Merging of Opinions withIncreasing Information”,Annals of MathematicalStatistics, 33(3): 882–886.
Boole, G., 1854,An Investigation of The Laws of Thought onWhich are Founded the Mathematical Theories of Logic andProbabilities, London: Macmillan, reprinted 1958, London:Dover.
Burnham, K.P. and D.R. Anderson, 2002,Model Selection andMultimodel Inference: A Practical Information-Theoretic Approach,New York: Springer, 2nd edition.
Carnap, R., 1950,Logical Foundations of Probability,Chicago: The University of Chicago Press.
–––, 1952,The Continuum of InductiveMethods, Chicago: University of Chicago Press.
Carnap, R. and Jeffrey, R.C. (eds.), 1970,Studies inInductive Logic and Probability, Volume I, Berkeley: Universityof California Press.
Casella, G., and R. L. Berger, 1987, “Reconciling Bayesianand Frequentist Evidence in the One-Sided Testing Problem”,Journal of the American Statistical Association, 82:106–111.
Cesa-Bianchi, N. and G. Lugosi, 2006.Prediction, Learning andGames, Cambridge: Cambridge University Press.
Claeskens, G. and N. L. Hjort, 2008,Model Selection and ModelAveraging, Cambridge: Cambridge University Press.
Cohen, J., 1994, “The Earth is Round (p < .05)”,American Psychologist, 49: 997–1001.
Cox, R.T., 1961,The Algebra of Probable Inference,Baltimore: John Hopkins University Press.
Cumming, G., 2012,Understanding The New Statistics: EffectSizes, Confidence Intervals, and Meta-Analysis, New York:Routledge.
Dawid, A.P., 1982, “The Well-Calibrated Bayesian”,Journal of the American Statistical Association, 77(379):605–610.
–––, 2004, “Probability, Causality and theEmpirical World: A Bayes-de Finetti-Popper-Borel Synthesis”,Statistical Science, 19: 44–57.
Dawid, A.P. and P. Grunwald, 2004, “Game Theory, MaximumEntropy, Minimum Discrepancy, and Robust Bayesian DecisionTheory”,Annals of Statistics, 32:1367–1433.
Dawid, A.P. and M. Stone, 1982, “The Functional-Model Basisof Fiducial Inference”,Annals of Statistics, 10:1054–1067.
De Finetti, B., 1937, “La Prévision: ses LoisLogiques, ses Sources Subjectives”,Annales del’Institut Henri Poincaré, reprinted as“Foresight: its Logical Laws, its Subjective Sources”, in:Kyburg, H. E. and H. E. Smokler (eds.),Studies in SubjectiveProbability, 1964, New York: Wiley.
–––, 1974,Theory of Probability,Volumes I and II, New York: Wiley, translation by A. Machi and A.F.M.Smith.
De Morgan, A., 1847,Formal Logic or The Calculus ofInference, London: Taylor & Walton, reprinted by London: OpenCourt, 1926.
Dempster, A.P., 1964, “On the Difficulties Inherent inFisher’s Fiducial Argument”,Journal of the AmericanStatistical Association, 59: 56–66.
–––, 1966, “New Methods for ReasoningTowards Posterior Distributions Based on Sample Data”,Annals of Mathematics and Statistics, 37(2):355–374.
–––, 1967, “Upper and Lower ProbabilitiesInduced by a Multivalued Mapping”,The Annals ofMathematical Statistics, 38(2): 325–339.
–––, 1968, “A Generalization of BayesianInference”,Journal of the Royal Statistical Society,Series B, Vol. 30: 205–247.
Diaconis, P., and D. Freedman, 1980, “De Finetti’sTheorem for Markov Chains”,Annals of Probability, 8:115–130.
Diaconis, P. and B. Skyrms, 2018,Ten Great Ideas aboutChance, Princeton: Princeton University Press.
Eagle, A. (ed.), 2010,Philosophy of Probability: ContemporaryReadings, London: Routledge.
Earman, J., 1992,Bayes or Bust? A Critical Examination ofBayesian Confirmation Theory, Cambridge (MA): MIT Press.
Easwaran, K., 2013, “Expected Accuracy SupportsConditionalization—and Conglomerability and Reflection”,Philosophy of Science, 80(1): 119–142.
Edwards, A.W.F., 1972,Likelihood, Cambridge: CambridgeUniversity Press.
Efron, B. and R. Tibshirani, R., 1993,An Introduction to theBootstrap, Boca Raton (FL): Chapman & Hall/CRC.
Festa, R., 1993,Optimum Inductive Methods, Dordrecht:Kluwer.
–––, 1996, “Analogy and Exchangeability inPredictive Inferences”,Erkenntnis, 45:89–112.
Fisher, R.A., 1925,Statistical Methods for ResearchWorkers, Edinburgh: Oliver and Boyd.
–––, 1930, “Inverse Probability”,Proceedings of the Cambridge Philosophical Society, 26:528–535.
–––, 1933, “The Concepts of InverseProbability and Fiducial Probability Referring to UnknownParameters”,Proceedings of the Royal Society, SeriesA, 139: 343–348.
–––, 1935a, “The logic of InductiveInference”,Journal of the Royal Statistical Society,98: 39–82.
–––, 1935b,The Design of Experiments,Edinburgh: Oliver and Boyd.
–––, 1935c, “The Fiducial Argument inStatistical Inference”,Annals of Eugenics, 6:317–324.
–––, 1955, “Statistical Methods andScientific Induction”,Journal of the Royal StatisticalSociety, B 17: 69–78.
–––, 1956,Statistical Methods andScientific Inference, New York: Hafner, 3rd edition 1973.
Fitelson, B., 2007, “Likelihoodism, Bayesianism, andRelational Confirmation”,Synthese, 156(3):473–489.
Fletcher S.C. and C. Mayo-Wilson, 2023, “Evidence inClassical Statistics”, in M. Lasonen-Aarnio and C. Littlejohn(eds.),The Routledge Handbook of the Philosophy of Evidence,London: Taylor and Francis, p. 515–527.
Forster, M. and E. Sober, 1994, “How to Tell when Simpler,More Unified, or Less Ad Hoc Theories will Provide More AccuratePredictions”,British Journal for the Philosophy ofScience, 45: 1–35.
Fraassen, B. van, 1989,Laws and Symmetry, Oxford:Clarendon Press.
Gaifman, H. and M. Snir, 1982, “Probabilities over RichLanguages”,Journal of Symbolic Logic, 47:495–548.
Galavotti, M. C., 2005,Philosophical Introduction toProbability, Stanford: CSLI Publications.
Genin, K., 2017. “The Topology of StatisticalVerifiability”,Proceedings of the 17th Conference onTheoretical Aspects of Rationality and Knowledge (TARK 2017),Electronic Proceedings in Theoretical Computer Science. [Genin 2017 available online]
Gelman, A., J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D.Rubin, 2013,Bayesian Data Analysis, revised edition, NewYork: Chapman & Hall/CRC.
Gelman, A., and C. Shalizi, 2013, “Philosophy and thePractice of Bayesian Statistics (with discussion)”,BritishJournal of Mathematical and Statistical Psychology, 66:8–18.
Giere, R. N., 1976, “A Laplacean Formal Semantics forSingle-Case Propensities”,Journal of PhilosophicalLogic, 5(3): 321–353.
Gillies, D., 1971, “A Falsifying Rule for ProbabilityStatements”,British Journal for the Philosophy ofScience, 22: 231–261.
–––, 2000,Philosophical Theories ofProbability, London: Routledge.
Gold, E., 1967, “Language Identification in theLimit”,Information and Control, 10:447–474.
Goldstein, M., 2006, “Subjective Bayesian Analysis:Principles and Practice”,Bayesian Analysis, 1(3):403–420.
Good, I.J., 1983,Good Thinking: The Foundations ofProbability and Its Applications, Minneapolis: University ofMinnesota Press; reprinted London: Dover, 2009.
–––, 1988, “The Interface BetweenStatistics and Philosophy of Science”,StatisticalScience, 3(4): 386–397.
Goodman, N., 1965,Fact, Fiction and Forecast,Indianapolis: Bobbs-Merrill.
Greaves, H. and D. Wallace, 2006, “JustifyingConditionalization: Conditionalization Maximizes Expected EpistemicUtility”,Mind, 115(459): 607–632.
Greco, D., 2011, “Significance Testing in Theory andPractice”,British Journal for the Philosophy ofScience, 62: 607–37.
Grünwald, P.D., 2007,The Minimum Description LengthPrinciple, Boston: MIT Press.
Hacking, I., 1965, The Logic of Statistical Inference, Cambridge:Cambridge University Press.
–––, 2006,The Emergence ofProbability, Cambridge: Cambridge University Press, 2ndedition.
Haenni, R., Romeijn, J.-W., Wheeler, G., Andrews, J., 2011,Probabilistic Logics and Probabilistic Networks, Berlin:Springer.
Hailperin, T., 1996,Sentential Probability Logic, LehighUniversity Press.
Hájek, A., 2007, “The Reference Class Problem is YourProblem Too”,Synthese, 156: 563–585.
Hájek, A. and C. Hitchcock (eds.), 2013,OxfordHandbook of Probability and Philosophy, Oxford: Oxford UniversityPress.
Halpern, J.Y., 2003,Reasoning about Uncertainty, MITpress.
Handfield, T., 2012,A Philosophical Guide to Chance: PhysicalProbability, Cambridge: Cambridge University Press.
Hannig, J., 2009, “On Generalized Fiducial Inference”,Statistica Sinica, 19: 491–544.
Harlow, L.L., S.A. Mulaik, and J.H. Steiger, (eds.), 1997,What If There Were No Significance Tests?, Mahwah (NJ):Erlbaum.
Harman, G. and S. Kulkarni, 2007,Reliable Reasoning:Induction and Statistical Learning Theory, Cambridge, MA: The MITPress.
Hastie, T., R. Tibshirani, and J. Friedman, 2009,The Elementsof Statistical Learning: Data Mining, Inference, and Prediction(Springer Series in Statistics), second edition, New York:Springer.
Henderson, L., N.D. Goodman, J.B. Tenenbaum, and J.F. Woodward,2010, “The Structure and Dynamics of Scientific Theories: AHierarchical Bayesian Perspective”,Philosophy ofScience, 77(2): 172–200.
Herrmann, D. A., 2020, “PAC Learning and Occam’sRazor: Probably Approximately Incorrect”,Philosophy ofScience, 87(4): 685–703.
Hjort, N., C. Holmes, P. Mueller, and S. Walker (eds.), 2010,Bayesian Nonparametrics, Cambridge Series in Statistical andProbabilistic Mathematics, nr. 28, Cambridge: Cambridge UniversityPress.
Howson, C., 2000,Hume’s Problem: Induction and theJustification of Belief, Oxford: Oxford University Press.
–––, 2003, “Probability and Logic”,Journal of Applied Logic, 1(3–4): 151–165.
–––, 2011, “Bayesianism as a Pure Logic ofInference”, in: Bandyopadhyay, P and Foster, M, (eds.),Philosophy of statistics, Handbook of the Philosophy ofScience, Oxford: North Holland, 441–472.
Howson, C. and P. Urbach, 2006,Scientific Reasoning: TheBayesian Approach, La Salle: Open Court, 3rd edition.
Hintikka, J., 1970, “Unknown Probabilities, Bayesianism, andde Finetti’s Representation Theorem”, inProceedingsof the Biennial Meeting of the Philosophy of Science Association,Vol. 1970, Boston: Springer, 325–341.
Hintikka, J. and I. Niiniluoto, 1980, “An AxiomaticFoundation for the Logic of Inductive Generalization”, in R.C.Jeffrey (ed.),Studies in Inductive Logic and Probability,volume II, Berkeley: University of California Press,157–181.
Hintikka, J. and P. Suppes (eds.), 1966,Aspects of InductiveLogic, Amsterdam: North-Holland.
Hitchcock, C. and A. Hájek (eds.), 2016,The OxfordHandbook of Probability and Philosophy, Oxford: Oxford UniversityPress.
Hume, D., 1739,A Treatise of Human Nature,available online.
Huttegger, S., 2017,The Probabilistic Foundations of RationalLearning, Cambridge: Cambridge University Press.
Huygens, Christiaan, 1657, “De Ratiociniis in AleæLudo”, inExercitionum Mathematicarum, libri quinque,Francis van Schooten (ed), Leiden: Johannis Elsevirii:517–534.
Ioannidis, J.P.A., 2005, “Why Most Published ResearchFindings Are False”,PLoS Medicine, 2(8): e124.doi:10.1371/journal.pmed.0020124
James, G., D. Witten, T. Hastie, R. Tibshirani, 2014,AnIntroduction to Statistical Learning Theory, New York:Springer.
Jaynes, E.T., 1973, “The Well-Posed Problem”,Foundations of Physics, 3: 477–493.
–––, 2003,Probability Theory: The Logic ofScience, Cambridge: Cambridge University Press,first 3 chapters available online.
Jeffrey, R., 1992,Probability and the Art of Judgment,Cambridge: Cambridge University Press.
Jeffreys, H., 1961,Theory of Probability, Oxford:Clarendon Press, 3rd edition.
Jolliffe, I.T., 2002,Principal Component Analysis, NewYork: Springer, 2nd edition.
Joyce, J., 1998, “A Nonpragmatic Vindication ofProbabilism”,Philosophy of Science, 65:575–603.
Kadane, J.B., 2011,Principles of Uncertainty, London:Chapman and Hall.
Kadane, J.B., M.J. Schervish, and T. Seidenfeld, 1996a,“When Several Bayesians Agree That There Will Be No Reasoning toa Foregone Conclusion”,Philosophy of Science, 63:S281–S289.
–––, 1996b, “Reasoning to a ForegoneConclusion”,Journal of the American StatisticalAssociation, 91(435): 1228–1235.
Kass, R. and A. Raftery, 1995, “Bayes Factors”,Journal of the American Statistical Association, 90:773–790.
Kelly, K.T., 1996,The Logic of Reliable Inquiry, Oxford:Oxford University Press.
–––, 2007, “A New Solution to the Puzzleof Simplicity”,Philosophy of Science 74(5):561–573.
Kelly, K., O. Schulte, and C. Juhl, 1997, “Learning Theoryand the Philosophy of Science”,Philosophy of Science,64: 245–67.
Keynes, J.M., 1921,A Treatise on Probability, London:Macmillan.
Kieseppä, I. A., 1997, “Akaike Information Criterion,Curve-Fitting, and the Philosophical Problem of Simplicity”,British Journal for the Philosophy of Science, 48(1):21–48.
–––, 2001, “Statistical Model SelectionCriteria and the Philosophical Problem of Underdetermination”,British Journal for the Philosophy of Science, 52(4):761–794.
Kingman, J.F.C., 1975, “Random DiscreteDistributions”,Journal of the Royal StatisticalSociety, 37: 1–22.
–––, 1978, “Uses ofExchangeability”,Annals of Probability, 6(2):183–197.
Kolmogorov, A.N., 1933,Grundbegriffe derWahrscheinlichkeitsrechnung, Berlin: Julius Springer.
Krantz, D. H., R. D. Luce, A. Tversky and P. Suppes, 1971,Foundations of Measurement, Volumes I and II. Mineola: DoverPublications.
Kuipers, T.A.F., 1978,Studies in Inductive Probability andRational Expectation, Dordrecht: Reidel.
–––, 1986, “Some Estimates of the OptimumInductive Method”,Erkenntnis, 24: 37–46.
Kyburg, Jr., H.E., 1961,Probability and the Logic of RationalBelief, Middletown (CT): Wesleyan University Press.
Kyburg, H.E. Jr. and C.M. Teng, 2001,UncertainInference, Cambridge: Cambridge University Press.
van Lambalgen, M., 1987,Random Sequences, Ph.D.dissertation, Department of Mathematics and Computer Science,University of Amsterdam,van Lambalgen 1987 available online.
Leitgeb, H. and Pettigrew, R., 2010a, “An ObjectiveJustification of Bayesianism I: Measuring Inaccuracy”,Philosophy of Science, 77(2): 201–235.
–––, 2010b, “An ObjectiveJustiﬁcation of Bayesianism II: The Consequences of MinimizingInaccuracy”,Philosophy of Science, 77(2):236–272.
Levi, I., 1980,The Enterprise of Knowledge: An Essay onKnowledge, Credal Probability, and Chance, Cambridge MA: MITPress.
Lindley, D.V., 1957, “A Statistical Paradox”,Biometrika, 44: 187–192.
–––, 1965,Introduction to Probability andStatistics from a Bayesian Viewpoint, Vols. I and II, Cambridge:Cambridge University Press.
–––, 2000, “The Philosophy ofStatistics”,Journal of the Royal Statistical Society,D (The Statistician), Vol. 49(3): 293–337.
Mackay, D.J.C., 2003,Information Theory, Inference, andLearning Algorithms, Cambridge: Cambridge University Press.
Maher, P., 1993,Betting on Theories, Cambridge Studiesin Probability, Induction and Decision Theory, Cambridge: CambridgeUniversity Press.
Mayo, D.G., 1996,Error and the Growth of ExperimentalKnowledge, Chicago: The University of Chicago Press.
–––, 2010, An Error in the Argument fromConditionality and Sufficiency to the Likelihood Principle, in: D.Mayo, A. Spanos (eds.), Error and Inference: Recent Exchanges onExperimental Reasoning, Reliability, and the Objectivity andRationality of Science, pp. 305–314, Cambridge: CambridgeUniversity Press.
–––, 2014, “On the Birnbaum Argument forthe Strong Likelihood Principle”,Statistical Science,29(2): 227–239.
–––, 2018,Statistical Inference as SevereTesting, Cambridge: Cambridge University Press.
Mayo, D.G., and A. Spanos, 2006, “Severe Testing as a BasicConcept in a Neyman-Pearson Philosophy of Induction”,TheBritish Journal for the Philosophy of Science, 57:323–357.
–––, 2011, “Error Statistics”, inP.S. Bandyopadhyay and M.R. Forster,Philosophy of Statistics,Handbook of the Philosophy of Science, Vol. 7, Elsevier.
Mayo, D.G., and D.R. Cox, 2006, “Frequentist Statistics as aTheory of Inductive Inference”, inIMS Lecture NotesMonograph Series (Volume 49: 2nd Lehmann Symposium onOptimality), Institute of Mathematical Statistics, 77–97.doi:10.1214/074921706000000400
Mellor, D. H., 2005,The Matter of Chance, Cambridge:Cambridge University Press.
–––, 2005,Probability: A PhilosophicalIntroduction, London: Routledge.
von Mises, R., 1981,Probability, Statistics and Truth,2nd revised English edition, New York: Dover.
Mood, A. M., F. A. Graybill, and D. C. Boes, 1974,Introduction to the Theory of Statistics, Boston:McGraw-Hill.
Morey, R., J.W. Romeijn and J. Rouder, 2013, “The HumbleBayesian”,British Journal of Mathematical and StatisticalPsychology, 66(1): 68–75.
Morey R.D., R. Hoekstra, J.N. Rouder, M.D. Lee and E.J.Wagenmakers, 2016, “The Fallacy of Placing Confidence inConfidence Intervals”,Psychonomic Bulletin and Review,23(1): 103–23.
Myung, J., V. Balasubramanian, and M.A. Pitt, 2000,“Counting Probability Distributions: Differential Geometry andModel Selection”,Proceedings of the National Academy ofSciences, 97(21): 11170–11175.
Nagel, T., 1939,Principles of the Theory of Probability,Chicago: University of Chicago Press.
Neyman, J., 1957, “Inductive Behavior as a Basic Concept ofPhilosophy of Science”,Revue Institute Internationale DeStatistique, 25: 7–22.
–––, 1971, Foundations of BehavioristicStatistics, in: V. Godambe and D. Sprott (eds.), Foundations ofStatistical Inference, Toronto: Holt, Rinehart and Winston of Canada,pp. 1–19.
Neyman, J. and K. Pearson, 1928, “On the Use andInterpretation of Certain Test Criteria for Purposes of StatisticalInference”,Biometrika, A20: 175–240 and264–294.
Neyman, J. and E. Pearson, 1933, “On the Problem of the MostEfficient Tests of Statistical Hypotheses”,PhilosophicalTransactions of the Royal Society, A 231: 289–337
–––, 1967,Joint Statistical Papers,Cambridge: Cambridge University Press.
Nix, C. J. and J. B. Paris, 2006, “A Continuum of InductiveMethods Arising from a Generalised Principle of InstantialRelevance”,Journal of Philosophical Logic, 35:83–115.
Osherson, D.N., M. Stob, and S. Weinstein, 1988, “MechanicalLearners Pay a Price for Bayesianism”,The Journal ofSymbolic Logic, 53(4): 1245–1251.
Orbanz, P. and Y.W. Teh, 2010, “Bayesian NonparametricModels”,Encyclopedia of Machine Learning, New York:Springer.
Paris, J.B., 1994,The Uncertain Reasoner’sCompanion, Cambridge: Cambridge University Press.
Paris, J.B. and A. Vencovska, 1989, “On the Applicability ofMaximum Entropy to Inexact Reasoning”,International Journalof Approximate Reasoning, 4(3): 183–224.
Paris, J., and P. Waterhouse, 2009, “Atom Exchangeabilityand Instantial Relevance”,Journal of PhilosophicalLogic, 38(3): 313–332.
Peirce, C. S., 1910, “Notes on the Doctrine ofChances”, in C. Hartshorne and P. Weiss (eds.),CollectedPapers of Charles Sanders Peirce, Vol. 2, Cambridge MA: HarvardUniversity Press, 405–14, reprinted 1931.
Plato, J. von, 1994,Creating Modern Probability,Cambridge: Cambridge University Press.
Popper, K.R., 1934/1959,The Logic of ScientificDiscovery, New York: Basic Books.
–––, 1959, “The Propensity Interpretationof Probability”,British Journal of the Philosophy ofScience, 10: 25–42.
Predd, J.B., R. Seiringer, E.H. Lieb, D.N. Osherson, H.V. Poor,and S.R. Kulkarni, 2009, “Probabilistic Coherence and ProperScoring Rules”,IEEE Transactions on InformationTheory, 55(10): 4786–4792.
Press, S. J., 2002,Bayesian Statistics: Principles, Models,and Applications (Wiley Series in Probability and Statistics),New York: Wiley.
Putnam, H., 1963, “Degree of Confirmation and InductiveLogic”, inThe Philosophy of Rudolf Carnap, P.A.Schilpp (ed.), La Salle, Ill: Open Court.
Raftery, A.E., 1995, “Bayesian Model Selection in SocialResearch”,Sociological Methodology, 25:111–163.
Ramsey, F.P., 1926, “Truth and Probability”, in R.B.Braithwaite (ed.),The Foundations of Mathematics and otherLogical Essays, Ch. VII, p.156–198, printed in London:Kegan Paul, 1931.
Reichenbach, H., 1938,Experience and prediction: an analysisof the foundations and the structure of knowledge, Chicago:University of Chicago Press.
–––, 1949,The theory of probability,Berkeley: University of California Press.
–––, 1956,The Direction of Time,Berkeley: University of Los Angeles Press.
Renyi, A., 1970,Probability Theory, Amsterdam: NorthHolland.
Robbins, H., 1952, “Some Aspects of the Sequential Design ofExperiments”,Bulletin of the American MathematicalSociety, 58: 527–535.
Roberts, H.V., 1967, “Informative Stopping Rules andInferences about Population Size”,Journal of the AmericanStatistical Association, 62(319): 763–775.
Romeijn, J.W., 2004, “Hypotheses and InductivePredictions”,Synthese, 141(3): 333–364.
–––, 2005,Bayesian Inductive Logic,PhD dissertation, University of Groningen.
–––, 2006, “Analogical Predictions forExplicit Similarity”,Erkenntnis, 64:253–280.
–––, 2011, “Statistics as InductiveLogic”, in Bandyopadhyay, P. and M. Forster (eds.),Handbookfor the Philosophy of Science, Vol. 7: Philosophy of Statistics,751–774.
–––, 2017, “Implicit Complexity”,Philosophy of Science, 84(5): 797–809.
Romeijn, J.W. and van de Schoot, R., 2008, “A PhilosophicalAnalysis of Bayesian model selection”, in Hoijtink, H., I.Klugkist and P. Boelen (eds.),Null, Alternative and InformativeHypotheses, 329–357.
Romeijn, J.W., van de Schoot, R., and Hoijtink, H., 2012,“One Size Does Not Fit All: Derivation of a Prior-AdaptedBIC”, in Dieks, D., W. Gonzales, S. Hartmann, F. Stadler, T.Uebel, and M. Weber (eds.),Probabilities, Laws, andStructures, Berlin: Springer.
Rosenkrantz, R.D., 1977,Inference, Method and Decision:Towards a Bayesian Philosophy of Science, Dordrecht: Reidel.
–––, 1981,Foundations and Applications ofInductive Probability, Ridgeview Press.
Royall, R., 1997,Scientific Evidence: A LikelihoodParadigm, London: Chapman and Hall.
Savage, L.J., 1962,The Foundations of StatisticalInference, London: Methuen.
Schervish, M.J., T. Seidenfeld, and J.B. Kadane, 2009,“Proper Scoring Rules, Dominated Forecasts, andCoherence”,Decision Analysis, 6(4):202–221.
Schurz, G., 2019,Hume’s Problem Solved, Cambridge,MA: MIT press.
Schwarz, G., 1978, “Estimating the Dimension of aModel”,Annals of Statistics, 6: 461–464.
Seidenfeld, T., 1979,Philosophical Problems of StatisticalInference: Learning from R.A. Fisher, Dordrecht: Reidel.
–––, 1986, “Entropy andUncertainty”,Philosophy of Science, 53(4):467–491.
–––, 1992, “R.A. Fisher’s FiducialArgument and Bayes Theorem”,Statistical Science, 7(3):358–368.
Shafer, G., 1976,A Mathematical Theory of Evidence,Princeton: Princeton University Press.
–––, 1982, “On Lindley’s Paradox(with discussion)”,Journal of the American StatisticalAssociation, 378: 325–351.
Shore, J. and Johnson, R., 1980, “Axiomatic Derivation ofthe Principle of Maximum Entropy and the Principle of MinimumCross-Entropy”,IEEE Transactions on InformationTheory, 26(1): 26–37.
Skyrms, B., 1991, “Carnapian Inductive Logic for MarkovChains”,Erkenntnis, 35: 439–460.
–––, 1993, “Analogy by Similarity inHypercarnapian Inductive Logic”, in Massey, G.J., J. Earman,A.I. Janis and N. Rescher (eds.),Philosophical Problems of theInternal and External Worlds: Essays Concerning the Philosophy ofAdolf Gruenbaum, Pittsburgh: Pittsburgh University Press,273–282.
–––, 1996, “Carnapian Inductive Logic andBayesian Statistics”, in: Ferguson, T.S., L.S. Shapley, and J.B.MacQueen (eds.),Statistics, Probability, and Game Theory: Papersin Honour of David Blackwell, Hayward: IMS lecture notes,321–336.
–––, 1999,Choice and Chance: AnIntroduction to Inductive Logic, Wadsworth, 4th edition.
Sober, E., 2004, “Likelihood, Model Selection, and theDuhem-Quine Problem”,Journal of Philosophy, 101(5):221–241.
Spanos, A., 2010, “Is Frequentist Testing Vulnerable to theBase-Rate Fallacy?”,Philosophy of Science, 77:565–583.
–––, 2013a, “Who Should Be Afraid of theJeffreys-Lindley Paradox?”,Philosophy of Science, 80:73–93.
–––, 2013b, “A Frequentist Interpretationof Probability for Model-Based Inductive Inference”,Synthese, 190: 1555–1585.
Spiegelhalter, D.J., N.G. Best, B.P. Carlin, and A. van der Linde,2002, “Bayesian Measures of Model Complexity and Fit”,Journal of Royal Statistical Society, B 64:583–639.
Spielman, S., 1974, “The Logic of SignificanceTesting”,Philosophy of Science, 41:211–225.
–––, 1978, “Statistical Dogma and theLogic of Significance Testing”,Philosophy of Science,45: 120–135.
Sprenger, J., 2013, “The Role of Bayesian Philosophy withinBayesian Model Selection”,European Journal for Philosophyof Science, 3(1): 101–114.
–––, Sprenger J., 2013, “Testing a PreciseNull Hypothesis: The Case of Lindley’s Paradox”,Philosophy of Science, 80(5): 733–744.
–––, 2016, “Bayesianism vs. Frequentism inStatistical Inference”, in Alan Hájek and Chris Hitchcock(eds.),Handbook of the Philosophy of Probability, Oxford:Oxford University Press, 382–405.
Spirtes, P., Glymour, C. and Scheines, R., 2001,Causation,Prediction, and Search, Boston: MIT press, 2nd edition.
Solomonoff, R.J., 1964, “A Formal Theory of InductiveInference”, parts I and II,Information and Control, 7:1–22 and 224–254.
Stegenga, J., 2011, “Is Meta-Analysis the Platinum Standardof Evidence?”,Studies in History and Philosophy ofBiological and Biomedical Sciences, 42(4): 497–507.
Steele, K., 2013, “Persistent Experimenters, Stopping Rules,and Statistical Inference”,Erkenntnis, 78(4):937–961.
Steele, K. and Werndl, C., 2013, “Climate Models,Calibration, and Confirmation”,British Journal for thePhilosophy of Science, 64: 609–635.
Sterkenburg, T., 2018,Universal Prediction: A PhilosophicalInvestigation, Ph.D. thesis, Rijksuniversiteit Groningen.
–––, 2020. “The Meta-InductiveJustification of Induction”,Episteme, 17(4):519–541.
–––, and P.D. Grünwald, 2021. “TheNo-Free-Lunch Theorems of Supervised Learning”,Synthese, 199: 9979–10015.
Suppes, P., 2001,Representation and Invariance of ScientificStructures, Chicago: University of Chicago Press.
Uffink, J., 1996, “The Constraint Rule of the MaximumEntropy Principle”,Studies in History and Philosophy ofModern Physics, 27: 47–79.
Vapnik, V. N., 2000,The Nature of Statistical LearningTheory. Statistics for Engineering and Information Science, 2ndedition, New York: Springer.
Vapnik, V.N. and S. Kotz, 2006,Estimation of DependencesBased on Empirical Data, New York: Springer.
Venn, J., 1888,The Logic of Chance, London: MacMillan,3rd edition.
Wagenmakers, E.J., 2007, “A Practical Solution to thePervasive Problems of p-Values”, Psychonomic Bulletin and Review14(5), 779–804.
Wagenmakers, E.J., and L.J. Waldorp, (eds.), 2006,Journal ofMathematical Psychology, 50(2). Special issue on Model Selection,99–214.
Wald, A., 1939, “Contributions to the Theory of StatisticalEstimation and Testing Hypotheses”,Annals of MathematicalStatistics, 10(4): 299–326.
–––, 1950,Statistical DecisionFunctions, New York: John Wiley and Sons.
Walley, P., 1991,Statistical Reasoning with ImpreciseProbabilities, New York: Chapman & Hall.
Wasserman, L., 2004,All of Statistics: A Concise Course inStatistical Inference, New York: Springer.
Williams, P.M., 1980, “Bayesian Conditionalisation and thePrinciple of Minimum Information”,British Journal for thePhilosophy of Science, 31: 131–144.
Williamson, J., 2010,In Defence of ObjectiveBayesianism, Oxford: Oxford University Press.
Ziliak, S.T. and D.N. McCloskey, 2008,The Cult of StatisticalSignificance, Ann Arbor: University of Michigan Press.
Zabell, S.L., 1992, “R. A. Fisher and the FiducialArgument”,Statistical Science, 7(3):358–368.
–––, 1982, “W. E. Johnson’s‘Sufficientness’ Postulate”,Annals ofStatistics, 10(4): 1090–1099.

Academic Tools

How to cite this entry.
Preview the PDF version of this entry at theFriends of the SEP Society.
Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO).
Enhanced bibliography for this entryatPhilPapers, with links to its database.

Other Internet Resources

[Please contact the author with suggestions.]

Open access to the SEP is made possible by a world-wide funding initiative.
The Encyclopedia Now Needs Your Support
Please Read How You Can Help Keep the Encyclopedia Free

Browse

About

Support SEP

Mirror Sites

View this site from another server:

USA (Main Site)Philosophy, Stanford University

Info about mirror sites

Library of Congress Catalog Data: ISSN 1095-5054

	How to cite this entry.
	Preview the PDF version of this entry at theFriends of the SEP Society.
	Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO).
	Enhanced bibliography for this entryatPhilPapers, with links to its database.

Movatterモバイル変換