1. Although enumerative inductive arguments may seem to be similar to whatclassical statisticians callestimation, it is not reallythe same thing. As classical statisticians are quick to point out,estimation does not use the sample toinductivelysupport a conclusion about the whole population.Estimation is not supposed to be a kind of inductiveinference. Rather,estimation is a decision strategy. Thesample frequency will be within two standard deviations of thepopulation frequency in about 95% of all samples. So, if one adoptsthe strategy ofaccepting as true the claim that thepopulation frequency is within two standard deviations of the samplefrequency, and if one uses this strategy repeatedly for varioussamples, one should be right about 95% of the time. I discussenumerative induction in much more detail in thesupplement on Enumerative Inductions: Bayesian Estimation and Convergence, which treats Bayesian convergence and the satisfaction of the CoA forthe special case of enumerative inductions.
2. Here are the more usual axioms for conditional probability functions.These axioms are provably equivalent to the axioms presented in the maintext.
A support function is a function \(P_{\alpha}\) from pairs ofsentences ofL to real numbers between 0 and 1 (inclusive) thatsatisfies the following axioms:
- (1)\(P_{\alpha}[D \pmidE] \ne 1\) for at least one pair of sentencesD andE.
For all sentence \(A, B\), andC,
- (2)If \(B \vDash A\),then \(P_{\alpha}[A \pmid B] = 1\);
- (3) If \(B \vDash C\)and \(C \vDash B\), then \(P_{\alpha}[A \pmid B] = P_{\alpha}[A \pmidC]\);
- (4) If \(C\vDash{\nsim}(B\cdot A)\), then either \[P_{\alpha}[(A \vee B) \pmid C] = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C]\] or else \(P_{\alpha}[D \pmid C] = 1\) for everyD;
- (5)\(P_{\alpha}[(A\cdotB) \pmid C] = P_{\alpha}[A \pmid (B\cdot C)] \times P_{\alpha}[B \pmidC]\).
The results stated in the main text derive from the axioms in the main text as follows:
Proof: First notice that (by axiom 2) all logicalentailments must have the same support value: for, when \(B \vDash A\)and \(D \vDash C\) we have \(P_{\alpha}[A \pmid B] \ge P_{\alpha}[C\pmid D] \ge P_{\alpha}[A \pmid B]\); so \(P_{\alpha}[C \pmid D] =P_{\alpha}[A \pmid B]\). Thus, all logical entailments \(B \vDash A\)must have the same real number support value \(r: P_{\alpha}[A \pmidB] = r\) whenever \(B \vDash A\). To see thatr must equaleither 1 or 0, use axiom 5 as follows (employing the various logicalentailments involved):
\[r = P_{\alpha}[A\cdot A \pmid A] = P_{\alpha}[A \pmid A\cdot A] \times P_{\alpha}[A \pmid A] = r \times r;\]so \(r = r \times r\); so \(r = 1\) or \(r = 0\). To see that \(r \ne0\), let us suppose \(r = 0\) and derive a contradiction. Axiom 1together with axiom 2 requires that for someA andB,\(P_{\alpha}[A \pmid B] \lt P_{\alpha}[A \pmid A] = r = 0\). Then, forthisA andB, since \(B \vDash(A \vee{\nsim}A)\) and \(B\vDash{\nsim}(A \cdot{\nsim}A)\), from axiom 4 we have
\[0 = r = P_{\alpha}[(A \vee{\nsim}A) \pmid B] = P_{\alpha}[A \pmid B] + P_{\alpha}[{\nsim}A \pmid B]\](or else \(P_{\alpha}[C \pmid B] = P_{\alpha}[B \pmid B] = 0\) forevery sentenceC, which contradicts our supposition that\(P_{\alpha}[A \pmid B] \lt 0\)). But then we must have\(P_{\alpha}[{\nsim}A \pmid B] \gt 0 = r = P_{\alpha}[A \pmid A]\),which contradicts axiom 2.
If \(C \vDash{\nsim}(B\cdot A)\), then either
\[P_{\alpha}[(A \vee B) \pmid C] = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C]\]or else \(P_{\alpha}[E \pmid C] = 1\) for every sentenceE.
Proof: Since \(E \vDash E\), \(P_{\alpha}[E \pmid E]= 1\). Thus, axiom 4 becomes: If \(C \vDash{\nsim}(B\cdot A)\), theneither
\[P_{\alpha}[(A \vee B) \pmid C] = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C]\]or else \(P_{\alpha}[E \pmid C] = 1\) for every sentenceE.
Proof: \(B \vDash(A \vee{\nsim}A)\) and \(B\vDash{\nsim}(A \cdot{\nsim}A)\), so
\[1 = P_{\alpha}[(A \vee{\nsim}A) \pmid B] = P_{\alpha}[A \pmid B] + P_{\alpha}[{\nsim}A \pmid B]\]or else \(P_{\alpha}[C \pmid B] = 1\) for every sentenceC.Thus, \(P_{\alpha}[{\nsim}A \pmid B] = 1 - P_{\alpha}[A \pmid B]\) orelse \(P_{\alpha}[C \pmid B] = 1\) for every sentenceC.
Proof: Result 1 (above) together with axiom 2 impliesthat \(1 \ge P_{\alpha}[A \pmid B]\), for all \(A, B\). Suppose thatfor someA andB, \(P_{\alpha}[A \pmid B] \lt 0\);then
\[P_{\alpha}[{\nsim}A \pmid B] = 1 - P_{\alpha}[A \pmid B] \gt 1,\]contradiction! So we must have \(P_{\alpha}[A \pmid B] \ge 0\). Thus,for all sentencesA andB, \(1 \ge P_{\alpha}[A \pmid B]\ge 0\).
Proof: Suppose \(B \vDash A\). Then \(\vDash{\nsim}(B\cdot{\nsim}A)\), so
\[\begin{align}1 & \ge P_{\alpha}[(B \vee{\nsim}A) \pmid C] \\& = P_{\alpha}[B \pmid C] + P_{\alpha}[{\nsim}A \pmid C] \\& = P_{\alpha}[B \pmid C] + 1 - P_{\alpha}[A \pmid C].\end{align}\]Thus, \(P_{\alpha}[A \pmid C] \ge P_{\alpha}[B \pmid C]\).
Proof: Suppose \(B \vDash A\) and \(A \vDash B\).Then, from the previous result we have \(P_{\alpha}[A \pmid C] =P_{\alpha}[B \pmid C]\).
Proof: Suppose \(C \vDash B\). Then, \((A\cdot C)\vDash B\), so \(P_{\alpha}[B \pmid (A\cdot C)] = 1\), by result 1above. Then, by axiom 5,
\[\begin{align}P_{\alpha}[(B\cdot A) \pmid C] & = P_{\alpha}[B \pmid (A\cdot C)] \times P_{\alpha}[A \pmid C] \\&= P_{\alpha}[A \pmid C].\end{align}\]Thus, since \((A\cdot B) \vDash(B\cdot A)\) and \((B\cdot A)\vDash(A\cdot B)\), result 6 above yields
\[\begin{align}P_{\alpha}[(A\cdot B) \pmid C] & = P_{\alpha}[(B\cdot A) \pmid C] \\&= P_{\alpha}[A \pmid C].\end{align}\]Proof: Suppose \(C \vDash B\) and \(B \vDash C\).Then, from previous results together with axioms 5 and 3 we have
\[\begin{align}P_{\alpha}[A \pmid B] & = P_{\alpha}[(A\cdot C) \pmid B] \\& = P_{\alpha}[A \pmid (C\cdot B)] \times P_{\alpha}[C \pmid B] \\& = P_{\alpha}[A \pmid (C\cdot B)]\\& = P_{\alpha}[A \pmid (B\cdot C)] \\& = P_{\alpha}[A \pmid (B\cdot C)] \times P_{\alpha}[B \pmid C] \\& = P_{\alpha}[(A\cdot B) \pmid C] \\& = P_{\alpha}[A \pmid C]\end{align}\]Proof: Axioms 5 together with result 6 yields
\[\begin{align}P_{\alpha}[A \pmid (B\cdot C)] \times P_{\alpha}[B \pmid C] & = P_{\alpha}[(A\cdot B) \pmid C] \\& = P_{\alpha}[(B\cdot A) \pmid C] \\& = P_{\alpha}[B \pmid (A\cdot C)] \times P_{\alpha}[A \pmid C].\end{align}\]Thus, if \(P_{\alpha}[B \pmid C] \gt 0\), then
\[P_{\alpha}[A \pmid (B\cdot C)] = P_{\alpha}[B \pmid (A\cdot C)] \times \frac{P_{\alpha}[A \pmid C]}{P_{\alpha}[B \pmid C]}.\]Proof: If \(P_{\alpha}[D \pmid C] = 1\) for everyD, then each term in the result is 1, so the result holds. So,let’s suppose that \(P_{\alpha}[D \pmid C] \ne 1\) for someD. Notice that
\[(A \vee B) \vDash(A \vee({\nsim}A \cdot B))\]and
\[(A \vee({\nsim}A \cdot B)) \vDash(A \vee B),\]and also
\[C \vDash{\nsim}(A \cdot({\nsim}A \cdot B)).\]We’ll also be using the fact that
\[B \vDash((A \cdot B)\vee({\nsim}A \cdot B))\]and
\[((A \cdot B)\vee({\nsim}A \cdot B)) \vDash B,\]and also
\[C \vDash{\nsim}((A \cdot B)\cdot({\nsim}A \cdot B)).\]Then
\[\begin{align}P_{\alpha}[(A \vee B) \pmid C] & = P_{\alpha}[(A \vee({\nsim}A\cdot B)) \pmid C]\\& = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C]\\& = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C] + P_{\alpha}[B \pmid C] \\&\qquad - P_{\alpha}[B \pmid C]\\& = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C] + P_{\alpha}[B \pmid C] \\&\qquad - P_{\alpha}[(A\cdot B)\vee({\nsim}A\cdot B) \pmid C]\\& = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C] + P_{\alpha}[B \pmid C] \\&\qquad - P_{\alpha}[(A\cdot B) \pmid C] - P_{\alpha}[({\nsim}A\cdot B) \pmid C]\\& = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C] - P_{\alpha}[(A\cdot B) \pmid C].\end{align}\]Proof: For each distincti andj, let\(C \vDash{\nsim}(B_{i}\cdot B_{j})\); and suppose that \(P_{\alpha}[D\pmid C] \lt 1\) for at least one sentenceD. First notice thatwe have, for eachi greater than 1 and less thann,
\[C \vDash({\nsim}(B_1\cdot B_{i+1})\cdot \ldots \cdot{\nsim}(B_{i}\cdot B_{i+1}));\]so
\[C \vDash{\nsim}(((B_1\vee B_2)\vee \ldots \vee B_i)\cdot B_{i+1}).\]Then, for any finite list of the firstn of the \(B_i\) (foreach value ofn),
\[\begin{align}P_{\alpha}&[(((B_1\vee B_2)\vee \ldots \vee B_{n-1})\vee B_n) \pmid C]\\&\qquad = P_{\alpha}[((B_1\vee B_2)\vee \ldots \vee B_{n-1}) \pmid C] + P_{\alpha}[B_n \pmid C] \\&\qquad = \ldots \\&\qquad = \sum^{n}_{i=1} P_{\alpha}[B_i \pmid C].\\\end{align}\]3. This result is not the rule commonly known ascountableadditivity. Countable additivity requires a language in whichinfinite disjunctions are defined.
The present result follows directly from the previous result (withoutappealing tocountable additivity), since, by definition
\[\sum^{\infty}_{i=1} P_{\alpha}[B_i \pmid C] = \lim_n\sum^{n}_{i=1} P_{\alpha}[B_i \pmid C].\] So, under the conditions stated in the main text,\[\lim_n P_{\alpha}[((B_1\vee B_2)\vee \ldots \vee B_n) \pmid C] = \sum^{\infty}_{i=1} P_{\alpha}[B_i \pmid C].\]Given a language in which infinite disjunction is defined, countable additivity would then result from the following rule (or axiom):
\[P_{\alpha}[((B_1\vee B_2)\vee \ldots) \pmid C] = lim_n P_{\alpha}[((B_1\vee B_2)\vee \ldots \vee B_n) \pmid C].\]4. Here are the usual axioms whenunconditional probability istaken as basic:
\(P_{\alpha}\) is a function from statements to real numbers between 0and 1 that satisfies the following rules:
Definition: if \(P_{\alpha}[B] \gt 0\), then
\[P_{\alpha}[A \pmid B] = \frac{P_{\alpha}[(A\cdot B)]}{P_{\alpha}[B]}.\]5. Bayesians often refer to the probability of an evidence statement ona hypothesis, \(P[e \pmid h\cdot b\cdot c]\), as thelikelihood ofthe hypothesis. This can be a somewhat confusing convention sinceit is clearly the evidence that is made likely to whatever degree bythe hypothesis. So, I will disregard the usual convention here. Also,presentations of probabilistic inductive logic often suppresscandb, and simply write ‘\(P[e \pmid h]\)’. Butc andb are important parts of the logic of thelikelihoods. So I will continue to make them explicit.
6. These attempts have not been wholly satisfactory thus far, butresearch continues. For an illuminating discussion of the logic ofdirect inference and the difficulties involved in providing a formalaccount, see the series of papers: Levi 1977; Kyburg 1978; Levi 1978. Levi 1980develops a very sophisticated approach.
Kyburg has developed a logic of statistical inference based solely onlogical direct inference probabilities (Kyburg 1974). Kyburg’slogical probabilities do not satisfy the usual axioms of probabilitytheory. The series of papers cited above compares Kyburg’sapproach to a kind of Bayesian inductive logic championed by Levi(e.g., in Levi 1967).
7. This idea should not be confused withlogical positivism orlogical empiricism. A version oflogical positivismapplied to likelihoods would hold that if two theories assign the samelikelihood values to all possible evidence claims, then they areessentially the same theory, though they may be couched in differentwords. In short:same likelihoods impliessametheory. The view suggested here, however, is notpositivism, but its inverse, which should be much lesscontroversial:different likelihoods impliesdifferenttheories. That is, given that all of the relevant background andauxiliaries are made explicit (represented in ‘b’),if two scientists disagree significantly about the likelihoods ofimportant evidence claims on a given hypothesis, they must understandthe empirical content of that hypothesis quite differently. To thatextent, though they may employ the same syntactic expressions, theyuse them to express empirically distinct hypotheses.
8. Call an objectgrue at a given timejust in caseeither the time is earlier than the first second of the year 2030 andthe object is green or the time is not earlier than the first secondof 2030 and the object is blue. Now the statement ‘All emeraldsare green (at all times)’ has the same syntactic structure as‘All emeralds are grue (at all times)’. So, if syntacticstructure determines priors, then these two hypotheses should have thesame prior probabilities. Indeed, both should have prior probabilitiesapproaching 0. For, there are an infinite number of competitors ofthese two hypotheses, each sharing the same syntactic structure:consider the hypotheses ‘All emeralds are grue\(_n\) (at alltimes)’, where an object is grue\(_n\) at a given timejustin case either the time is earlier than the first second of thenth day after January 1, 2030, and the object isgreenor the time is not earlier than the first second of thenth day after January 1, 2030, and the object isblue. A purely syntactic specification of the priors should assign allof these hypotheses the same prior probability. But these are mutuallyexclusive hypotheses; so their prior probabilities must sum to a valueno greater than 1. The only way this can happen is for ‘Allemeralds are green’ and each of its grue\(_n\) competitors tohave prior probability values either equal to 0 or infinitesimallyclose to it.
9. This assumption may be substantially relaxed without affecting theanalysis below; we might instead only suppose that the ratios\(P_{\alpha}[c^n \pmid h_j\cdot b]/P_{\alpha}[c^n \pmid h_i\cdot b]\)are bounded so as not to get exceptionally far from 1. Ifthat supposition were to fail, then the mere occurrence ofthe experimental conditions would count as very strong evidence for oragainst hypotheses—a highly implausible effect. Our analysiscould include such bounded condition-ratios, but this would only addinessential complexity to our treatment.
10. For example, when a new disease is discovered, a new hypothesis\(h_{u+1}\) about that disease being a possible cause ofpatients’ symptoms is made explicit. The old catch-all was,“the symptoms are caused by some unknown disease—somedisease other than \(h_1 ,\ldots ,h_u\)”. So the new catch-allhypothesis must now state that “the symptoms are caused by oneof the remaining unknown diseases—some disease other than \(h_1,\ldots ,h_u, h_{u+1}\)”. And, clearly,
\[\begin{align}P_{\alpha}[h_K \pmid b] & = P_{\alpha}[{\nsim}h_1\cdot \ldots \cdot{\nsim}h_u \pmid b]\\& = P_{\alpha}[{\nsim}h_1\cdot \ldots \cdot{\nsim}h_u\cdot(h_{u+1}\vee{\nsim}h_{u+1}) \pmid b]\\& = P_{\alpha}[{\nsim}h_1\cdot \ldots \cdot{\nsim}h_{u}\cdot{\nsim}h_{u+1} \pmid b] + P_{\alpha}[h_{u+1} \pmid b]\\& = P_{\alpha}[h_{K*} \pmid b] + P_{\alpha}[h_{u+1} \pmid b].\end{align}\]Thus, the new hypothesis \(h_{u+1}\) is “peeled off” ofthe old catch-all hypothesis \(h_K\), leaving a new catch-allhypothesis \(h_{K*}\) with a prior probability value equal to that ofthe old catch-all minus the prior of the new hypothesis.
11. This claim depends, of course, on \(h_i\) being evidentially distinctfrom each alternative \(h_j\). I.e., there must be conditions c\(_k\)with possible outcomes o\(_{ku}\) on which the likelihoods differ:
\[P[o_{ku} \pmid h_{i}\cdot b\cdot c_{k}] \ne P[o_{ku} \pmid h_{j}\cdot b\cdot c_{k}].\]Otherwise \(h_i\) and \(h_j\) are empirically equivalent, and noamount of evidence can support one over the other. (Did you think aconfirmation theory could possibly do better?—could somehowemploy evidence to confirm the true hypothesis overevidentiallyequivalent rivals?) If the true hypothesis has evidentiallyequivalent rivals, then the convergence result implies that the oddsagainstthe disjunction of the true hypothesis with theserivals very probably goes to 0, so the posterior probability of thisdisjunction goes to 1. Among evidentially equivalenthypotheses the ratio of their posterior probabilities equals the ratioof their priors:
\[\frac{P_{\alpha}[h_j \pmid b\cdot c^n\cdot e^n]}{P_{\alpha}[h_i \pmid b\cdot c^n\cdot e^n]} = \frac{P_{\alpha}[h_j \pmid b]}{P_{\alpha}[h_i \pmid b]}.\]So the true hypothesis will have a posterior probability near 1 (afterevidence drives the posteriors of evidentially distinguishable rivalsnear to 0)just in case plausibility arguments andconsiderations (expressed inb) make each evidentiallyindistinguishable rival so much less plausible by comparison that thesum of each of their comparative plausibilities (as compared to thetrue hypothesis) remains very small.
One more comment about this. It is tempting to identifyevidentialdistinguishability (via theevidential likelihoods) withempirical distinguishability. But many plausibility argumentsin the sciences, such asthought experiments, draw on broadlyempirical considerations, on what we know or strongly suspect abouthow the world works based on our experience of the world. Althoughthis kind of “evidence” may not be representable viaevidential likelihoods (because the hypotheses it bears ondon’t deductively or probabilistically imply it), it often playsan important role in scientific assessments of hypotheses—inassessments of whether a hypothesis is so extraordinary that onlyreally extraordinary likelihood evidence could rescue it. It is(arguably) a distinct virtue of the Bayesian logic of evidentialsupport that it permits such considerations to be figured into the netsupport for hypotheses.
12. This is a good place to describe one reason for thinking thatinductive support functions must be distinct fromsubjectivist or personalistdegree-of-belief functions.Although likelihoods have a high degree of objectivity in manyscientific contexts, it is difficult forbelief functions toproperly represent objective likelihoods. This is an aspect of theproblem of old evidence.
Belief functions are supposed to provide an idealized modelof belief strengths for agents. They extend the notion of ideallyconsistent belief to a probabilistic notion of ideally coherent beliefstrengths. There is no harm in this kind of idealization. It issupposed to supply a normative guide for real decision making. Anagent is supposed to make decisions based on her belief-strengthsabout the state of the world, her belief strengths about possibleconsequences of actions, and her assessment of the desirability (orutility) of these consequences. But the very role thatbelief functions are supposed to play in decision makingmakes them ill-suited to inductive inferences where thelikelihoods are often supposed to be objective, or at leastpossess inter-subjectively agreed values that represent the empiricalimport of hypotheses. For the purposes of decision making,degree-of-belief functionsshould represent the agent’sbelief strengthsbased on everything she presently knows. So,degree-of-belief likelihoods must represent how strongly the agentwould believe the evidence if the hypothesis were added toeverything else she presently knows. However,support-function likelihoods are supposed to represent what thehypothesis (together with explicit background and experimentalconditions)says orimplies about the evidence. As aresult,degree-of-belief likelihoods are saddled with aversion of theproblem of old evidence, a problem not sharedby support function likelihoods. Furthermore, it turns out that theold evidence problem for likelihoods is much worse than is usuallyrecognized.
Here is the problem. If the agent is already certain of an evidencestatemente, then herbelief-function likelihoods forthat statement must be 1 on every hypothesis. I.e., if \(Q_{\gamma}\)is herbelief function and \(Q_{\gamma}[e] = 1\), then itfollows from the axioms of probability theory that \(Q_{\gamma}[e\pmid h_i\cdot b\cdot c] = 1\), regardless of what \(h_i\)says—even if \(h_i\) implies thate is quite unlikely(given \(b\cdot c)\). But the problem goes even deeper. It not onlyapplies to evidence that the agentknows with certainty. Itturns out that almost anything the agent learns that can change howstrongly she believese will also influence the value of herbelief-function likelihood fore, because\(Q_{\gamma}[e \pmid h_i\cdot b\cdot c]\) represents the agent’sbelief strength giveneverything she knows.
To see the difficulty with less-than-certain evidence, consider thefollowing example. Lete be any statement that is statisticallyimplied to degreer by a hypothesish together withexperimental conditionsc (e.g.,e says “the coinlandsheads on the next toss” and \(h\cdot c\) says“the coin is fair and is tossed in the usual way on the nexttoss”). Then the correct objective likelihood value is just\(P[e \pmid h\cdot c] = r\) (e.g., for \(r = 1/2)\). Letd be astatement that is intuitively not relevant in any way to how likelye should be on \(h\cdot c\) (e.g., letd say “Jimwill be really pleased with the outcome of that next toss”).Suppose some rational agent has a degree-of-belief functionQfor which the likelihood fore due to \(h\cdot c\) agrees withthe objective value: \(Q[e \pmid h\cdot c] = r\) (e.g., with \(r =1/2)\).
Our analysis will show that this agent’s belief-strength ford given \({\nsim}e\cdot h\cdot c\) will be a relevant factor;so suppose that her degree-of-belief in that regard has any values other than 1: \(Q[d \pmid {\nsim}e\cdot h\cdot c] = s \lt 1\)(e.g., suppose \(s = 1/2)\). This is a very weak supposition. It onlysays that adding \({\nsim}e\cdot h\cdot c\) to everything else theagent currently knows leaves her less than certain thatd istrue.
Now, suppose this agent learns the following bit of new information ina completely convincing way (e.g., I seriously tell her so, and shebelieves me completely): \((d\vee e)\) (i.e., Jim will be reallypleased with the outcome of the next toss unless it comes upheads).
Thus, on the usual Bayesian degree-of-belief account the agent issupposed to update her belief functionQ to arrive at a newbelief function \(Q_{\textit{new}}\) by the updating rule:
\(Q_{\textit{new}} [S] = Q[S \pmid (d\vee e)]\), for each statementS.
However, this update of the agent’s belief functionhas toscrew up the objectivity of her new belief-function likelihoodfore on \(h\cdot c\), because she now should have:
\[\begin{align}Q_{new}[e \pmid h\cdot c] & = \frac{Q_{new}[e\cdot h\cdot c]}{Q_{new}[h\cdot c]}\\[1ex]& = \frac{Q[e\cdot h\cdot c \pmid (d\vee e)]}{Q[h\cdot c \pmid (d\vee e)]}\\[1ex]& = \frac{Q[(d\vee e)\cdot(e\cdot h\cdot c)]}{Q[(d\vee e)\cdot(h\cdot c)]}\\[1ex]& = \frac{Q[(d\vee e)\cdot e \pmid h\cdot c]}{Q[(d\vee e) \pmid h\cdot c]}\\[1ex]& = \frac{Q[e \pmid h\cdot c]}{Q[((d\cdot{\nsim}e)\vee e) \pmid h\cdot c]}\\[1ex]& = \frac{Q[e \pmid h\cdot c]}{Q[e \pmid h\cdot c] + Q[d\cdot{\nsim}e \pmid h\cdot c]]}\\[1ex]& = \frac{Q[e \pmid h\cdot c]}{[Q[e \pmid h\cdot c] + Q[d \pmid {\nsim}e \cdot h\cdot c] \times Q[{\nsim}e \pmid h\cdot c]]}\\[1ex]& = \frac{r}{[r + s\times(1- r)]}\\[1ex]& = \frac{1}{[1 + s\times\frac{(1- r)}{r}]}.\end{align}\]Thus, the updated belief function likelihood must have value
\[Q_{new}[e \pmid h\cdot c] = \frac{1}{1 + s\times \frac{(1- r)}{r}}.\]This factor can be equal to the correct likelihood valuer justin case\(s = 1\). For example, for \(r =1/2\) and \(s = 1/2\) we get \(Q_{new}[e \pmid h\cdot c] = 2/3\).
The point is that even the most trivial knowledge of disjunctiveclaims involvinge may completely upset the value of thelikelihood for an agent’s belief function. And an agent willalmost always have some such trivial knowledge. Updating on suchconditionals can force the agent’sbelief functions todeviate widely from the evidentially relevant objective values oflikelihoods on which scientific hypotheses should be tested.
More generally, it can be shown that the incorporation into a belieffunctionQ of almost any kind of evidence for or against thetruth of a prospective evidence claime—even uncertainevidence fore, as may come through Jeffreyupdating—completely undermines the objective orinter-subjectively agreed likelihoods that a belief function mighthave expressed before updating. This should be no surprise. Theagent’s belief function likelihoods reflect hertotaldegree-of-belief ine, based on a hypothesishtogether witheverything else she knows aboute. Sothe agent’s present belief function may capture appropriatepublic likelihoods fore only ife is completelyisolated from the agents other beliefs. And this will rarely be thecase.
One Bayesian subjectivist response to this kind of problem is that thebelief functions employed in scientific inductive inferencesshould often be “counterfactual” belief functions, whichrepresent what the agentwould believe ife weresubtracted (in some suitable way) from everything else she knows (see,e.g., Howson & Urbach 1993). However, our examples show that merely subtractinge won’t do. One must also subtract any disjunctivestatements containinge. And it can be shown that one mustsubtract any uncertain evidence for or againste as well. Sothe counterfactual belief function idea needs a lot of working out ifit is to rescue the idea thatsubjectivist Bayesian belieffunctions can provide a viable account of the likelihoodsemployed by the sciences in inductive inferences.
13. That is, for each inductive support function \(P_{\alpha}\), theposterior \(P_{\alpha}[h_j \pmid b\cdot c^n\cdot e^n]\) must go to 0as the ratio
\[\frac{P_{\alpha}[h_j \pmid b\cdot c^n\cdot e^n]}{P_{\alpha}[h_i \pmid b\cdot c^n\cdot e^n]}\]goes to 0; and that must occur if the likelihood ratios
\[\frac{P[e^n \pmid h_{j}\cdot b\cdot c^{n}]}{P[e^n \pmid h_{i}\cdot b\cdot c^{n}]}\]approach 0, provided that and the prior probability \(P_{\alpha}[h_i\pmid b]\) is greater than 0. The Likelihood Ratio Convergence Theoremwill show that when \(h_i\cdot b\) is true, it is very likely that theevidence will indeed be such as to drive the likelihood ratios as nearto 0 as you please, for a long enough (or strong enough) evidencestream. (If the stream isstrong in that the likelihoodratios of individual bits of evidence are small, then to bring about avery small cumulative likelihood ratio, the evidence stream need notbe as long.) As likelihood ratios head towards 0, the only way aBayesian agent can avoid having her inductive support function(s)yield posterior probabilities for \(h_j\) that approach 0 (asngets large) is to continually change her prior probabilityassessments. That means either continually finding and adding newplausibility arguments (i.e., adding to or modifyingb) that onbalance favor \(h_j\) over \(h_i\), or continually reassessing thesupport strength due to plausibility arguments already available, orboth.
Technically, continual reassessments of support strengths that favor\(h_j\) over \(h_i\) based on already extant arguments (inb)means switching to new support functions (or newvaguenesssets of them) that assign \(h_j\) ever higher prior probabilitiesas compared to \(h_i\) based on the same arguments inb. In anycase, such revisions of argument strengths may avoid the convergencetowards 0 of the posterior probability of \(h_j\) only if it proceedsat a rate that keeps ahead of the rate at which the evidence drivesthe likelihood ratios towards 0.
For a thorough presentation of the most prominent Bayesian convergenceresults and a discussion of their weaknesses see Earman 1992: Ch. 6.However, Earman does not discuss the convergence theorems underconsideration here (due to the fact that the convergence resultsdiscussed here first appeared in Hawthorne 1993, just afterEarman’s book came out).
14. In scientific contexts all of the most important kinds of cases wherelarge components of the evidence fail to beresult-independent of one another are cases where some partof the total evidence helps to tie down the numerical value of aparameter that plays an important role in the likelihood values thehypothesis specifies for other large parts of the total evidence. Incases where this only happens ratherlocally, where theevidence for a parameter value influences the likelihoods of only avery small part of the total evidence that bears on the hypothesis, wecan treat the conjunction ofthe evidence for the parametervalue withthe evidential outcomes whose likelihood theparameter value influences as a singlechunk ofevidence, which is thenresult-independent of the restof the evidence (on each alternative hypothesis). This is the sort ofchuncking of the evidence intoresult-independentparts suggested in the main text.
However, in cases where the value of a parameter left unspecified bythe hypothesis has a wide-ranging influence on many of the likelihoodvalues the hypothesis specifies, another strategy for obtainingresult-independence among these components of the evidencewill do the job. A hypothesis that has an unspecified parameter valueis in effect equivalent to adisjunction of more specifichypotheses, where each disjunct consists of a more preciseversion of the original hypothesis, a version in which the value forthe parameter has been “filled in”. Relative to each ofthese more precise hypotheses, any evidence for or against theparameter value that hypothesis specifies is evidence for or againstthat more precise hypothesis itself. Furthermore, the evidence whoselikelihood values depend on the parameter value (and because of that,failed to beresult-independent of the parameter valueevidence relative to the original hypothesis) isresult-independent of the parameter value evidence relativeto each of these more precise hypotheses—because each of theprecise hypotheses already identifies precisely what (it claims) thevalue of the parameter is. Thus, wherever the workings of the logic ofevidential support is made more perspicuous by treating evidence ascomposed ofresult-independent chunks, one may treathypotheses whose unspecified parameter values interfere withresult-independence asdisjunctively compositehypotheses, and apply the evidential logic to these more specificdisjuncts, and thereby regainresult-independence.
15. Technically, suppose that \(O_k\) can be further“subdivided” into more outcome-descriptions by replacing\(o_{kv}\) with two mutually exclusive parts, \(o_{kv}^*\) and\(o_{kv}^\#\), to produce new outcome space
\[O_{k}^$ = \{o_{k1},\ldots, o_{kv}^*, o_{kv}^\#,\ldots , o_{kw}\},\]where
\[P[o_{kv}^*\cdot o_{kv}^\# \pmid h_{i}\cdot b\cdot c_{k}] = 0\]and
\[P[o_{kv}^{* } \pmid h_{i}\cdot b\cdot c_{k}] + P[o_{kv}^\# \pmid h_{i}\cdot b\cdot c_{k}] = P[o_{kv} \pmid h_{i}\cdot b\cdot c_{k}];\]and suppose similar relationships hold for \(h_j\). Then the new EQI*(based on \(O_{k}^*)\) is greater than or equal to EQI (based on\(O_k)\); and \(\EQI^* \gt \EQI\)just in case at least oneof the new likelihood ratios, e.g.,
\[\frac{P[o_{kv}^{*} \pmid h_{i}\cdot b\cdot c_{k}]}{P[o_{kv}^* \pmid h_{j}\cdot b\cdot c_{k}]},\]differs in value from the “undivided” outcome’slikelihood ratio,
\[\frac{P[o_{kv} \pmid h_{i}\cdot b\cdot c_{k}] }{P[o_{kv} \pmid h_{i}\cdot b\cdot c_{k}]}.\]Thesupplement on the effect on EQI of partitioning the outcome space proves this claim.
16. The likely rate of convergence will almost always be much faster thanthe worst case bound provided byTheorem 2. To see the point more clearly, let’s look at a very simpleexample. Suppose \(h_i\) says that a certain bent coin has apropensity for “heads” of 2/3 and \(h_j\) says thepropensity is 1/3. Let the evidence stream consist of outcomes oftosses. In this case the average EQI equals the EQI of each toss,which is 1/3; and the smallest possible likelihood ratio occurs for“heads”, which yields the value \(\gamma = \frac{1}{2}\).So, the value of the lower bound given by Theorem 2 for the likelihoodof getting an outcome sequences with a likelihood ratio below\(\varepsilon\) (for \(h_j\) over \(h_i)\) is
\[1 -\frac{ (1/n)(\log \tfrac{1}{2})^2}{((1/3) + (\log \varepsilon)/n)^2}= 1 - \frac{9}{(n\times(1 + 3(\log \varepsilon)/n)^2}.\]Thus, according to the theorem, the likelihood of getting an outcomesequence with a likelihood ratio less than \(\varepsilon = 1/16\)(=.06) when \(h_i\) is true and the number of tosses is \(n = 52\) isat least .70; and for \(n = 204\) tosses the likelihood isat least .95.
To see the amount by which the lower bound provided by the theorem isin factoverly cautious, consider what the usual binomialdistribution for the coin tosses in this example implies about thelikely values of the likelihood ratios. The likelihood ratio forexactlyk “heads” inn tosses is
\[\frac{((1/3)^k (2/3)^{n-k})}{((2/3)^k (1/3)^{n-k})} = 2^{n-2k};\]and we want this likelihood ratio to have a value less than\(\varepsilon\). A bit of algebraic manipulation shows that to getthis likelihood ratio value to be below \(\varepsilon\), thepercentage of “heads” needs to be \(k/n \gt \frac{1}{2} -\frac{1}{2}(\log \varepsilon)/n\). Using the normal approximation tothe binomial distribution (with mean \(= 2/3\) and variance =\((2/3)(1/3)/n)\) the actual likelihood of obtaining an outcomesequence having more than \(\frac{1}{2} - \frac{1}{2}(\log\varepsilon)/n\) “heads” (which we just saw corresponds togetting a likelihood ratio less than \(\varepsilon\), thus disfavoringthe 1/3 propensity hypothesis as compared to the 2/3 propensityhypothesis by that much) when the true propensity for“heads” is 2/3 is given by the formula
\[\begin{align}\Phi\left[\frac{\left(\textrm{mean} - \left(\frac{1}{2} - \frac{1}{2}\frac{(\log \varepsilon)}{n}\right)\right)}{(\textrm{variance})^{\frac{1}{2}}}\right]= \Phi\left[(1/8)^{\frac{1}{2}}n^{\frac{1}{2}}\left(1 + \frac{3(\log \varepsilon)}{n}\right)\right]\end{align}\](where \(\Phi[x]\) gives the value of the standard normal distributionfrom \(-\infty\) tox). Now let \(\varepsilon = 1/16\) (=.0625), as before. So the actual likelihood of obtaining a stream ofoutcomes with likelihood ratio this small when \(h_i\) is true and thenumber of tosses is \(n = 52\) is \(\Phi[1.96] \gt .975\), whereas thelower bound given byTheorem 2 was .70. And if the number of tosses is increased to \(n = 204\), thelikelihood of obtaining an outcome sequence with a likelihood ratiothis small (i.e., \(\varepsilon = 1/16)\) is \(\Phi[4.75] \gt.999999\), whereas the lower bound from Theorem 2 for this likelihoodis .95. Indeed, to actually get a likelihood of .95 that the evidencestream will produce a likelihood ratio less than \(\varepsilon\gt\).06, the number of tosses needed is only \(n = 43\), rather thanthe 204 tosses the bound given by the theorem requires in order to getup to the value .95. (Note: These examples employ “identicallydistributed” trials—repeated tosses of a coin—as anillustration. But Convergence Theorem 2 applies much more generally.It applies to any evidence sequence, no matter how diverse theprobability distributions for the various experiments or observationsin the sequence.)
17. It should now be clear why the boundedness of EQI above 0 isimportant. ConvergenceTheorem 2 applies only when
\[\bEQI[c^{n} \pmid h_i /h_j \pmid b] \gt \frac{-(\log \varepsilon)}{n}.\]But this requirement is not a strong assumption. For, theNonnegativity of EQI Theorem shows that the empiricaldistinctness of two hypotheses on a single possible outcomesuffices to make the average EQI positive for the wholesequence of experiments. So, given any small fraction \(\varepsilon\gt 0\), the value of \(-(\log \varepsilon)/n\) (which is alwaysgreater than 0 when \(\varepsilon \lt 0)\) will eventually becomesmaller than \(\bEQI\), provided that the degree to which thehypotheses are empirical distinct for the various observations \(c_k\)does not on average degrade too much as the lengthn of theevidence stream increases.
When the possible outcomes for the sequence of observations areindependent and identically distributed, Theorems 1 and 2 effectivelyreduce to L. J. Savage’s Bayesian Convergence Theorem [Savage,pg. 52–54], although Savage’s theorem doesn’t supplyexplicit lower bounds on the probability that the likelihood ratiowill be small. Independent, identically distributed outcomes mostcommonly result from the repetition of identical statisticalexperiments (e.g., repeated tosses of a coin, or repeated measurementsof quantum systems prepared in identical states). In such experimentsa hypothesis will specify the same likelihoods for the same kinds ofoutcomes from one observation to the next. So \(\bEQI\) will remainconstant as the number of experiments,n, increases. However,Theorems 1 and 2 are much more general. They continue to hold when thesequence of observations encompasses completely unrelated experimentsthat have different distributions on outcomes—experiments thathave nothing in common but their connection to the hypotheses theytest.
18. In many scientific contexts this is the best we can hope for. But itstill provides a very reasonable representation of inductive support.Consider, for example, the hypothesis that the land masses of Africaand South America separated and drifted apart over the eons, thedrift hypothesis, as opposed to the hypothesis that thecontinents have fixed positions acquired when the earth first formedand cooled and contracted, thecontraction hypothesis. Onemay not be able to determine anything like precise likelihoods, oneach hypothesis, for the evidence that: (1) the shape of the eastcoast of South America matches the shape of the west coast of Africaas closely as it in fact does; (2) the geology of the two coasts matchup so closely when they are “fitted together” in theobvious way; (3) the plant and animal species on these distantcontinents should be as similar as they are, as compared to howsimilar species are among other distant continents. Although neitherthedrift hypothesis nor thecontraction hypothesissupplies anything like precise likelihoods for these evidentialclaims, experts readily agree that each of these observations ismuch more likely on thedrift hypothesis than on thecontraction hypothesis. That is, the likelihood ratio forthis evidence on thecontraction hypothesis as compared tothedrift hypothesis is very small. Thus, jointly theseobservations constitute very strong evidence fordrift overcontraction.
Historically, the case of continental drift is more complicated.Geologists tended to largely dismiss this evidence until the 1960s.This was not because the evidence wasn’t strong in its ownright. Rather, this evidence was found unconvincing because it was notsufficient to overcome prior plausibility considerations that made thedrift hypothesis extremely implausible—much lessplausible than thecontraction hypothesis. The problem wasthat there seemed to be no plausible mechanism by whichdriftmight occur. It was argued, quite plausibly, that no known force couldpush or pull the continents apart, and that the less dense continentalmaterial could not push through the denser material that makes up theocean floor. These plausibility objections were overcome when aplausible mechanism was articulated—i.e., the continental crustfloats atop molten material and moves apart as convection currents inthe molten material carry it along. The case was pretty well clinchedwhen evidence for this mechanism was found in the form of“spreading zones” containing alternating strips ofmagnetized material at regular distances from mid-ocean ridges. Themagnetic alignments of materials in these strips correspond closely tothe magnetic alignments found in magnetic materials in dateablesedimentary layers at other locations on the earth. These magneticalignments indicate time periods when the direction of earth’smagnetic field has reversed. And this gave geologists a way ofmeasuring the rate at which the sea floor might spread and thecontinents move apart. Although geologists may not be able todetermine anything like precise values for the likelihoods of any ofthis evidence on each of the alternative hypotheses, the evidence isuniversally agreed to bemuch more likely on thedrifthypothesis than on the alternativecontractionhypothesis. Thelikelihood ratio for this evidence onthecontraction hypothesis as compared to thedrifthypothesis is somewhat vague, but extremely small. The vaguenessis only in regard how extremely small the likelihood ratio is.Furthermore, with the emergence of a plausible mechanism, thedrift hypothesis is no longer so overwhelmingly implausible(logically) prior to taking the likelihood evidence intoaccount. Thus, even when precise values for individual likelihoods arenot available, the value ofa likelihood ratio range may beobjective enough to strongly refute one hypothesis ascompared to another. Indeed, thedrift hypothesis is itselfstrongly supported by the evidence; for, no alternative hypothesisthat has the slightest amount of comparative plausibility can accountfor the available evidence nearly so well. That is, no plausiblealternative makes the evidence anywhere near so likely. Given thecurrently available evidence, the only issues left open (for now)involve comparing various alternative versions of the drift hypothesis(involving differences of detail) against one another.
19. To see the point of the third clause, suppose it were violated. Thatis, suppose there are possible outcomes for which the likelihood ratiois very near 1 for just one of the two support functions. Then, even avery long sequence of such outcomes might leave the likelihood ratiofor one support function almost equal to 1, while the likelihood ratiofor the other support function goes to an extreme value. If that canhappen for support functions in a class that represent likelihoods forvarious scientists in the community, then the empirical contents ofthe hypotheses are either too vague or too much in dispute formeaningful empirical evaluation to occur.
20. If there are a few directionally controversial likelihood ratios,where \(P_{\alpha}\) says the ratio is somewhat greater than 1, whileand \(P_{\beta}\) assigns a value somewhat less than 1, these may notgreatly effect the trend of \(P_{\alpha}\) and \(P_{\beta}\) towardsagreement on the refutation and support of hypothesesprovidedthat the controversial ratios are not so extreme as to overwhelmthe stream of other evidence on which the likelihood ratios dodirectionally agree. Even so, researches will want to get straight onwhat each hypothesissays orimplies about suchcases. While that remains in dispute, the empirical content of eachhypothesis remains unsettlingly vague.
21. What it means for a sample to berandomly selected from apopulation is philosophically controversial. Various analyses of theconcept have been proposed, and disputed. For our purposes an accountof the following sort will suffice. To say
S is a random sample of populationB with respect toattributeA
means that
the selection setS is generated by a process that has anobjective chance (or propensity)r of choosing individualobjects that have attributeA from among the objects inpopulationB, where on each selection the chance valueragrees with the valuer of the frequency ofAs among theBs, \(F[A,B]\).
Defined this way, randomness implies probabilistic independence amongthe outcomes of selections with regard to whether they exhibitattributeA, on any given hypothesis about the true value ofthe frequencyr ofAs among theBs.
The tricky part of generating a randomly selected set from thepopulation is to find a selection process for which the chance ofselecting anA each time matches the true frequency withoutalready knowing what the true frequency value is—i.e., withoutalready knowing what the value ofr is. However, there clearlyare ways to do this. Here is one way:
the sampleS is generated by a process that on each selectiongives each member ofB an equal chance of being selected intoS (like drawing balls from a well-shaken urn).
Here, schematically, is another way:
find a subclass ofB, call itC, from whichS canbe generated by a process that gives every member ofC an equalchance of being selected intoS, whereC isrepresentative ofB with respect toA in thesense that the frequency ofA inC is almost preciselythe same as the frequency ofA inB.
Pollsters use a process of this kind. Ideally a poll of registeredvoters, populationB, should select a sampleS in a waythat gives every registered voter the same chance of getting selectedintoS. But that may be impractical. However, it suffices ifthe sample is selected from a representative subpopulationC ofB—e.g., from registered voters who answered the telephonebetween the hours of 7 PM and 9 PM in the middle of the week. Ofcourse, the claim that a given subpopulationC isrepresentative is itself a hypothesis that is open toinductive support by evidence. Professional polling organizations do alot of research to calibrate their sampling technique, to find outwhat sort of subpopulationsC they may draw on as highlyrepresentative. For example, one way to see if registered voters whoanswer the phone during the evening, mid-week, are likely toconstitute a representative sample is to conduct a large poll of suchvoters immediately after an election, when the result is known, to seehow representative of the actual vote count the count from of thesubpopulation turns out to be.
Notice that although the selection setS isselected fromB,S cannot be a subset ofB, not ifS canbe generated bysampling with replacement. For, a specificmember ofB may be randomly selected intoS more thanonce. IfS were a subset ofB, any specific member ofB could only occur once inS. That is, consider the casewhereS consists ofn selections fromB, butwhere the process happens to select the same memberb ofB twice. Then, wereS a subset ofB, althoughb is selected intoS twice,S can only possessb as a member once, soS has at most \(n-1\) membersafter all (even fewer if other members ofB are selected morethan once). So, rather than being members ofB, the members ofS must berepresentations of members of B, like names,where the same member ofB may be represented by differentnames. However, the representations (or names) inS technicallymay not be the sorts of things that can possess attributeA.So, technically, on this way of handling the problem, when we say thata member ofS exhibits A, this is shorthand forthereferent of S in B possesses attribute A.
22. This is closely analogous to the Stable-Estimation Theorem ofEdwards, Lindman, & Savage (1993). Here is a proof of Case 1,i.e., where the number of members of the reference classB isfinite and where for some integeru at least as large as thesize ofB there is a specific (perhaps very large) integerK such that the prior probability of a hypothesis stating afrequency outside regionR is never more thanK times aslarge as a hypothesis stating a frequency within regionR. (Theproof is Case 2 is almost exactly the same, but draws on integralswherever the present proof draws on sums using the‘\(\sum\)’ expression.)
A few observations before proceeding to the main derivation:
DefineL to be the smallest value of a prior probability\(P_{\alpha}[F[A,B]=r \pmid b]\) forr a fraction inR.Notice that \(L \gt 0\) because, by supposition, finite
\[K \ge \frac{P_{\alpha}[F[A,B]=s \pmid b] }{ P_{\alpha}[F[A,B]=r \pmid b]}\]for the largest value of \(P_{\alpha}[F[A,B]=s \pmid b]\) for whichs is outside ofR and the smallest value of\(P_{\alpha}[F[A,B]=r \pmid b]\) for whichr is outside ofregionR.
and
\[\begin{multline} \sum_{s\not\in R} r^m \times(1-r)^{n-m} \times(P_{\alpha}[F[A,B]=r \pmid b] / L) \\\ge \sum_{r\in R} r^m \times(1-r)^{n-m} \times P_{\alpha}[F[A,B]=r \pmid b] .\end{multline}\]is extremely close to the value of \(\beta[R, m+1, n-m+1]\).
We now proceed to the main part of the derivation.
From the Odds Form of Bayes’ Theorem (Equation 10) we have,
\( \begin{align} &\Omega_{\alpha}\left[\begin{aligned}&F[A,B]\not\in R \\ & \begin{split}{} \pmid F[A,S] =m/n &\cdot \Rnd[S,B,A] \\ & \cdot \Size[S] =n \\ & \cdot b\end{split} \end{aligned}\right] \\[1ex]\end{align}\)
\(\begin{align}& =\frac{\sum_{s\not\in R} P_{\alpha}\left[\begin{aligned}& F[A,B]=s \\&\begin{split}{} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\& \cdot \Size[S] =n \\& \cdot b\end{split}\end{aligned}\right]}{\sum_{r\in R} P_{\alpha}\left[\begin{aligned}& F[A,B] =r \\&\begin{split}{} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\& \cdot \Size[S]=n \\&\cdot b\end{split}\end{aligned}\right]}\\[1ex]\end{align}\)
\(\begin{align}& =\frac{\sum_{s\not\in R} P\left[\begin{aligned}& F[A,S]=m/n \\&\begin{split}{} \pmid F[A,B] =s & \cdot \Rnd[S,B,A] \\& \cdot \Size[S]=n \\& \cdot b\end{split}\end{aligned}\right] \times P_{\alpha}[F[A,B] =s \pmid b]}{\sum_{r\in R} P\left[\begin{aligned}& F[A,S] =m/n \\& \begin{split}{} \pmid F[A,B] =r & \cdot \Rnd[S,B,A] \\& \cdot \Size[S] =n \\& \cdot b\end{split}\end{aligned}\right] \times P_{\alpha}[F[A,B] =r \pmid b]}\\[1ex]\end{align}\)
\(\begin{align}& =\frac{\sum_{s\not\in R} s^m \times(1-s)^{n-m} \times P_{\alpha}[F[A,B]=s \pmid b]}{\sum_{r\in R} r^m\times(1-r)^{n-m} \times P_{\alpha}[F[A,B]=r \pmid b]}\\[1ex]\end{align}\)
\(\begin{align}& =\frac{\sum_{s\not\in R} s^m\times(1-s)^{n-m} \times(P_{\alpha}[F[A,B]=s \pmid b] / L)}{\sum_{r\in R} r^m\times(1-r)^{n-m} \times(P_{\alpha}[F[A,B]=r \pmid b] / L)}\\[1ex]\end{align}\)
\(\begin{align}& \le\frac{\sum_{s\not\in R} s^m\times(1-s)^{n-m} \times K}{\sum_{r\in R} r^m\times(1-r)^{n-m}}\\[1ex]\end{align}\)
\(\begin{align}&= K \times\frac{\sum_{s\in \{s \pmid s=k/u\}} s^m\times(1-s)^{n-m} - \sum_{r\in R} r^m\times(1-r)^{n-m}}{\sum_{r\in R} r^m\times(1-r)^{n-m}}\\[1ex]\end{align}\)
\(\begin{align}&= K \times\frac{\sum_{s\in \{s \pmid s=k/u\}} s^m\times(1-s)^{n-m}}{\sum_{r\in R} r^m\times(1-r)^{n-m}} - 1\\[1ex]\end{align}\)
\(\begin{align}&\approx K\times[(1/\beta[R, m+1, n-m+1]) - 1]. \end{align} \)
Thus,
\[\begin{multline}\Omega_{\alpha}\left[\begin{aligned}& F[A,B]\not\in R \\&\begin{split}\pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\& \cdot \Size[S]=n \\& \cdot b\end{split}\end{aligned}\right]\\[1ex]\le K\times\left[\left(\frac{1}{\beta[R, m+1, n-m+1]}\right) - 1\right]. \end{multline}\]Then by equation (11), which expressesthe relationship betweenposterior probability andposterior odds against,
\[\begin{multline}P_{\alpha}\left[\begin{aligned}&F[A,B]\in R \\& \begin{split}{} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\& \cdot \Size[S] =n \\\cdot b\end{split}\end{aligned}\right]\\= 1 / \left(1 + \Omega_{\alpha}\left[\begin{aligned}&F[A,B]\not\in R \\&\begin{split}{}\pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\& \cdot \Size[S] =n \\& \cdot b\end{split}\end{aligned}\right]\right)\\\ge \frac{1}{\left(1 + K\times\left[\left(\frac{1}{\beta[R, m+1, n-m+1]}\right) - 1\right]\right)}. \end{multline}\]23. To get a better idea of the import of this theorem, let’sconsider some specific values. First notice that the factor\(r\times(1-r)\) can never be larger than (1/2)\(\times\)(1/2) \(=1/4\); and the closerr is to 1 or 0, the smaller\(r\times(1-r)\) becomes. So, whatever the value ofr, thefactor q/\(((r\times(1-r)/n)^{\frac{1}{2}} \le 2\times\)q\(\timesn^{\frac{1}{2}}\). Thus, for any chosen value ofq,
\[\begin{multline}P\left[\begin{aligned}& r-q \lt F[A,S] \lt r+q \\& \begin{split}{} \pmid F[A,B] = r & \cdot\Rnd[S,B,A]\\& \cdot\Size[S] = n\end{split}\end{aligned}\right]\ge 1 - 2\times \Phi[-2\times q\times n^{\frac{1}{2}}].\end{multline}\]For example, if \(q =\) .05 and \(n = 400\), then we have (for anyvalue ofr),
\[\begin{multline}P\left[\begin{aligned}& r-.05 \lt F[A,S] \lt r+.05\\&\begin{split}{} \pmid F[A,B] = r & \cdot\Rnd[S,B,A]\\& \cdot\Size[S] = 400\end{split}\end{aligned}\right]\ge .95.\end{multline}\]For \(n = 900\) (and margin \(q =\) .05) this lower bound raises to.997:
\[\begin{multline}P\left[\begin{aligned}& r-.05 \lt F[A,S] \lt r+.05 \\&\begin{split}{} \pmid F[A,B] = r & \cdot\Rnd[S,B,A]\\& \cdot\Size[S] = 900\end{split}\end{aligned}\right]\ge .997.\end{multline}\]If we are interested in a smaller margin of errorq, we cankeep the same sample size and find the value of the lower bound forthat value ofq. For example,
\[\begin{multline}P\left[\begin{aligned}& r-.03 \lt F[A,S] \lt r+.03 \\& \begin{split}{} \pmid F[A,B] = r & \cdot\Rnd[S,B,A]\\& \cdot\Size[S] = 900\end{split}\end{aligned}\right]\ge .928.\end{multline}\]By increasing the sample size the bound on the likelihood can be madeas close to 1 as we want, for any marginq we choose. Forexample:
\[\begin{multline}P\left[\begin{aligned}& r-.01 \lt F[A,S] \lt r+.01 \\& \begin{split}{} \pmid F[A,B] = r & \cdot \Rnd[S,B,A]\\& \cdot \Size[S] = 38000\end{split}\end{aligned}\right] \ge .9999.\end{multline}\]As the sample sizen becomes larger, it becomes extremelylikely that the sample frequency will come to within any specifiedregion close to the true frequencyr, as close as you wish.
24. To see the point more clearly, consider an example. To keep thingssimple, let’s suppose our backgroundb says that thechances ofheads for tosses of this coin is some wholepercentage between 0% and 100%. Letc say that the coin istossed in the usual random way; lete say that the coin comesup heads; and for eachr that is a whole fraction of 100between 0 and 1, let \(h_{[r]}\) be thesimple statisticalhypothesis asserting that the chance of heads on each random tossof this coin isr. Now consider thecomposite statisticalhypothesis \(h_{[\gt .65]}\), which asserts that the chance ofheads on each random (independent) toss is greater than .65. From theaxioms of probability we derive the following relationship:
\[\begin{align}P_{\alpha}[e \pmid h_{[\gt .65]}\cdot c\cdot b] = & P[e \pmid h_{[.66]}\cdot c\cdot b] \times P_{\alpha}[h_{[.66]} \pmid h_{[\gt .65]}\cdot c\cdot b] \\& + P[e \pmid h_{[.67]}\cdot c\cdot b] \times P_{\alpha}[h_{[.67]} \pmid h_{[\gt .65]}\cdot c\cdot b] \\& + \ldots \\& + P[e \pmid h_{[1]}\cdot c\cdot b] \times P_{\alpha}[h_{[1]} \pmid h_{[\gt .65]}\cdot c\cdot b].\end{align}\]The issue for thelikelihoodist is that the values of theterms of form \(P_{\alpha}[h_{[r]} \pmid h_{[\gt .65]}\cdot c\cdotb]\) are not objectively specified by the composite hypothesis\(h_{[\gt .65]}\) (together with \(c\cdot b)\), but the value of thelikelihood \(P_{\alpha}[e \pmid h_{[\gt .65]}\cdot c\cdot b]\) dependsessentially on these non-objective factors. So, likelihoods based oncomposite statistical hypotheses fail to possess the kind ofobjectivity thatlikelihoodists require.
25. TheLaw of Likelihood and theLikelihoodPrinciple have been formulated in slightly different ways byvarious logicians and statisticians. TheLaw ofLikelihood was first identified by that name in Hacking 1965,and has been invoked more recently by thelikelihoodiststatisticians A.F.W. Edwards (1972) and R. Royall (1997). R.A. Fisher(1922) argued for theLikelihood Principle early inthe 20th century, although he didn’t call it that.One of the first places it is discussed under that name is Savage etal. 1962.
View this site from another server:
The Stanford Encyclopedia of Philosophy iscopyright © 2016 byThe Metaphysics Research Lab, Center for the Study of Language and Information (CSLI), Stanford University
Library of Congress Catalog Data: ISSN 1095-5054