Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Likelihood function

From Wikipedia, the free encyclopedia
(Redirected fromLog-likelihood)
Function related to statistics and probability theory
Part of a series on
Bayesian statistics
Posterior =Likelihood ×Prior ÷Evidence
Background
Model building
Posterior approximation
Estimators
Evidence approximation
Model evaluation

Alikelihood function (often simply called thelikelihood) measures how well astatistical model explainsobserved data by calculating the probability of seeing that data under differentparameter values of the model. It is constructed from thejoint probability distribution of therandom variable that (presumably) generated the observations.[1][2][3] When evaluated on the actual data points, it becomes a function solely of the model parameters.

Inmaximum likelihood estimation, theargument that maximizes the likelihood function serves as apoint estimate for the unknown parameter, while theFisher information (often approximated by the likelihood'sHessian matrix at the maximum) gives an indication of the estimate'sprecision.

In contrast, inBayesian statistics, the estimate of interest is theconverse of the likelihood, the so-calledposterior probability of the parameter given the observed data, which is calculated viaBayes' rule.[4]

Definition

[edit]

The likelihood function, parameterized by a (possibly multivariate) parameterθ{\textstyle \theta }, is usually defined differently fordiscrete and continuousprobability distributions (a more general definition is discussed below). Given a probability density or mass function

xf(xθ),{\displaystyle x\mapsto f(x\mid \theta ),}

wherex{\textstyle x} is a realization of the random variableX{\textstyle X}, the likelihood function isθf(xθ),{\displaystyle \theta \mapsto f(x\mid \theta ),}often writtenL(θx).{\displaystyle {\mathcal {L}}(\theta \mid x).}

In other words, whenf(xθ){\textstyle f(x\mid \theta )} is viewed as a function ofx{\textstyle x} withθ{\textstyle \theta } fixed, it is a probability density function, and when viewed as a function ofθ{\textstyle \theta } withx{\textstyle x} fixed, it is a likelihood function. In thefrequentist paradigm, the notationf(xθ){\textstyle f(x\mid \theta )} is often avoided and insteadf(x;θ){\textstyle f(x;\theta )} orf(x,θ){\textstyle f(x,\theta )} are used to indicate thatθ{\textstyle \theta } is regarded as a fixed unknown quantity rather than as arandom variable being conditioned on.

The likelihood function doesnot specify the probability thatθ{\textstyle \theta } is the truth, given the observed sampleX=x{\textstyle X=x}. Such an interpretation is a common error, with potentially disastrous consequences (seeprosecutor's fallacy).

Discrete probability distribution

[edit]

LetX{\textstyle X} be a discreterandom variable withprobability mass functionp{\textstyle p} depending on a parameterθ{\textstyle \theta }. Then the function

L(θx)=pθ(x)=Pθ(X=x),{\displaystyle {\mathcal {L}}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x),}

considered as a function ofθ{\textstyle \theta }, is thelikelihood function, given theoutcomex{\textstyle x} of the random variableX{\textstyle X}. Sometimes the probability of "the valuex{\textstyle x} ofX{\textstyle X} for the parameter valueθ{\textstyle \theta } " is written asP(X =x |θ) orP(X =x;θ). The likelihood is the probability that a particular outcomex{\textstyle x} is observed when the true value of the parameter isθ{\textstyle \theta }, equivalent to the probability mass onx{\textstyle x}; it isnot a probability density over the parameterθ{\textstyle \theta }. The likelihood,L(θx){\textstyle {\mathcal {L}}(\theta \mid x)}, should not be confused withP(θx){\textstyle P(\theta \mid x)}, which is the posterior probability ofθ{\textstyle \theta } given the datax{\textstyle x}.

Example

[edit]
Figure 1.  The likelihood function (pH2{\textstyle p_{\text{H}}^{2}}) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HH.
Figure 2.  The likelihood function (pH2(1pH){\textstyle p_{\text{H}}^{2}(1-p_{\text{H}})}) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HHT.

Consider a simple statistical model of a coin flip: a single parameterpH{\textstyle p_{\text{H}}} that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed.pH{\textstyle p_{\text{H}}} can take on any value within the range 0.0 to 1.0. For a perfectlyfair coin,pH=0.5{\textstyle p_{\text{H}}=0.5}.

Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip isi.i.d., then the probability of observing HH is

P(HHpH=0.5)=0.52=0.25.{\displaystyle P({\text{HH}}\mid p_{\text{H}}=0.5)=0.5^{2}=0.25.}

Equivalently, the likelihood of observing "HH" assumingpH=0.5{\textstyle p_{\text{H}}=0.5} is

L(pH=0.5HH)=0.25.{\displaystyle {\mathcal {L}}(p_{\text{H}}=0.5\mid {\text{HH}})=0.25.}

This is not the same as saying thatP(pH=0.5HH)=0.25{\textstyle P(p_{\text{H}}=0.5\mid HH)=0.25}, a conclusion which could only be reached viaBayes' theorem given knowledge about the marginal probabilitiesP(pH=0.5){\textstyle P(p_{\text{H}}=0.5)} andP(HH){\textstyle P({\text{HH}})}.

Now suppose that the coin is not a fair coin, but instead thatpH=0.3{\textstyle p_{\text{H}}=0.3}. Then the probability of two heads on two flips is

P(HHpH=0.3)=0.32=0.09.{\displaystyle P({\text{HH}}\mid p_{\text{H}}=0.3)=0.3^{2}=0.09.}

Hence

L(pH=0.3HH)=0.09.{\displaystyle {\mathcal {L}}(p_{\text{H}}=0.3\mid {\text{HH}})=0.09.}

More generally, for each value ofpH{\textstyle p_{\text{H}}}, we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure 1. The integral ofL{\textstyle {\mathcal {L}}} over [0, 1] is 1/3; likelihoods need not integrate or sum to one over the parameter space.

Continuous probability distribution

[edit]

LetX{\textstyle X} be arandom variable following anabsolutely continuous probability distribution withdensity functionf{\textstyle f} (a function ofx{\textstyle x}) which depends on a parameterθ{\textstyle \theta }. Then the function

L(θx)=fθ(x),{\displaystyle {\mathcal {L}}(\theta \mid x)=f_{\theta }(x),}

considered as a function ofθ{\textstyle \theta }, is thelikelihood function (ofθ{\textstyle \theta }, given theoutcomeX=x{\textstyle X=x}). Again,L{\textstyle {\mathcal {L}}} is not a probability density or mass function overθ{\textstyle \theta }, despite being a function ofθ{\textstyle \theta } given the observationX=x{\textstyle X=x}.

Relationship between the likelihood and probability density functions

[edit]

The use of theprobability density in specifying the likelihood function above is justified as follows. Given an observationxj{\textstyle x_{j}}, the likelihood for the interval[xj,xj+h]{\textstyle [x_{j},x_{j}+h]}, whereh>0{\textstyle h>0} is a constant, is given byL(θx[xj,xj+h]){\textstyle {\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])}. Observe thatargmaxθL(θx[xj,xj+h])=argmaxθ1hL(θx[xj,xj+h]),{\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h]),}sinceh{\textstyle h} is positive and constant. Becauseargmaxθ1hL(θx[xj,xj+h])=argmaxθ1hPr(xjxxj+hθ)=argmaxθ1hxjxj+hf(xθ)dx,{\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\Pr(x_{j}\leq x\leq x_{j}+h\mid \theta )=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx,}

wheref(xθ){\textstyle f(x\mid \theta )} is the probability density function, it follows that

argmaxθL(θx[xj,xj+h])=argmaxθ1hxjxj+hf(xθ)dx.{\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx.}

The firstfundamental theorem of calculus provides thatlimh0+1hxjxj+hf(xθ)dx=f(xjθ).{\displaystyle \lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx=f(x_{j}\mid \theta ).}

ThenargmaxθL(θxj)=argmaxθ[limh0+L(θx[xj,xj+h])]=argmaxθ[limh0+1hxjxj+hf(xθ)dx]=argmaxθf(xjθ).{\displaystyle {\begin{aligned}\mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x_{j})&=\mathop {\operatorname {arg\,max} } _{\theta }\left[\lim _{h\to 0^{+}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])\right]\\[4pt]&=\mathop {\operatorname {arg\,max} } _{\theta }\left[\lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx\right]\\[4pt]&=\mathop {\operatorname {arg\,max} } _{\theta }f(x_{j}\mid \theta ).\end{aligned}}}

Therefore,argmaxθL(θxj)=argmaxθf(xjθ),{\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\mathop {\operatorname {arg\,max} } _{\theta }f(x_{j}\mid \theta ),}and so maximizing the probability density atxj{\textstyle x_{j}} amounts to maximizing the likelihood of the specific observationxj{\textstyle x_{j}}.

In general

[edit]

Inmeasure-theoretic probability theory, thedensity function is defined as theRadon–Nikodym derivative of the probability distribution relative to a common dominating measure.[5] The likelihood function is this density interpreted as a function of the parameter, rather than the random variable.[6] Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.)

The above discussion of the likelihood for discrete random variables uses thecounting measure, under which the probability density at any outcome equals the probability of that outcome.

Likelihoods for mixed continuous–discrete distributions

[edit]

The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability massespk(θ){\textstyle p_{k}(\theta )} and a densityf(xθ){\textstyle f(x\mid \theta )}, where the sum of all thep{\textstyle p}'s added to the integral off{\textstyle f} is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simplyL(θx)=pk(θ),{\displaystyle {\mathcal {L}}(\theta \mid x)=p_{k}(\theta ),}wherek{\textstyle k} is the index of the discrete probability mass corresponding to observationx{\textstyle x}, because maximizing the probability mass (or probability) atx{\textstyle x} amounts to maximizing the likelihood of the specific observation.

The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observationx{\textstyle x}, but not with the parameterθ{\textstyle \theta }.

Regularity conditions

[edit]

In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions areassumed in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation, the existence of a global maximum of the likelihood function is of the utmost importance. By theextreme value theorem, it suffices that the likelihood function iscontinuous on acompact parameter space for the maximum likelihood estimator to exist.[7] While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case,concavity of the likelihood function plays a key role.

More specifically, if the likelihood function is twice continuously differentiable on thek-dimensional parameter spaceΘ{\textstyle \Theta } assumed to be anopenconnected subset ofRk,{\textstyle \mathbb {R} ^{k}\,,} there exists a unique maximumθ^Θ{\textstyle {\hat {\theta }}\in \Theta } if thematrix of second partialsH(θ)[2Lθiθj]i,j=1,1ni,nj{\displaystyle \mathbf {H} (\theta )\equiv \left[\,{\frac {\partial ^{2}L}{\,\partial \theta _{i}\,\partial \theta _{j}\,}}\,\right]_{i,j=1,1}^{n_{\mathrm {i} },n_{\mathrm {j} }}\;} isnegative definite for everyθΘ{\textstyle \,\theta \in \Theta \,} at which the gradientL[Lθi]i=1ni{\textstyle \;\nabla L\equiv \left[\,{\frac {\partial L}{\,\partial \theta _{i}\,}}\,\right]_{i=1}^{n_{\mathrm {i} }}\;} vanishes,and if the likelihood function approaches a constant on theboundary of the parameter space,Θ,{\textstyle \;\partial \Theta \;,} i.e.,limθΘL(θ)=0,{\displaystyle \lim _{\theta \to \partial \Theta }L(\theta )=0\;,}which may include the points at infinity ifΘ{\textstyle \,\Theta \,} is unbounded. Mäkeläinen and co-authors prove this result usingMorse theory while informally appealing to a mountain pass property.[8] Mascarenhas restates their proof using themountain pass theorem.[9]

In the proofs ofconsistency and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about the probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda.[10] In particular, foralmost allx{\textstyle x}, and for allθΘ,{\textstyle \,\theta \in \Theta \,,}logfθr,2logfθrθs,3logfθrθsθt{\displaystyle {\frac {\partial \log f}{\partial \theta _{r}}}\,,\quad {\frac {\partial ^{2}\log f}{\partial \theta _{r}\partial \theta _{s}}}\,,\quad {\frac {\partial ^{3}\log f}{\partial \theta _{r}\,\partial \theta _{s}\,\partial \theta _{t}}}\,}exist for allr,s,t=1,2,,k{\textstyle \,r,s,t=1,2,\ldots ,k\,} in order to ensure the existence of aTaylor expansion. Second, for almost allx{\textstyle x} and for everyθΘ{\textstyle \,\theta \in \Theta \,} it must be that|fθr|<Fr(x),|2fθrθs|<Frs(x),|3fθrθsθt|<Hrst(x){\displaystyle \left|{\frac {\partial f}{\partial \theta _{r}}}\right|<F_{r}(x)\,,\quad \left|{\frac {\partial ^{2}f}{\partial \theta _{r}\,\partial \theta _{s}}}\right|<F_{rs}(x)\,,\quad \left|{\frac {\partial ^{3}f}{\partial \theta _{r}\,\partial \theta _{s}\,\partial \theta _{t}}}\right|<H_{rst}(x)}whereH{\textstyle H} is such thatHrst(z)dzM<.{\textstyle \,\int _{-\infty }^{\infty }H_{rst}(z)\mathrm {d} z\leq M<\infty \;.} This boundedness of the derivatives is needed to allow fordifferentiation under the integral sign. And lastly, it is assumed that theinformation matrix,I(θ)=logfθr logfθs f dz{\displaystyle \mathbf {I} (\theta )=\int _{-\infty }^{\infty }{\frac {\partial \log f}{\partial \theta _{r}}}\ {\frac {\partial \log f}{\partial \theta _{s}}}\ f\ \mathrm {d} z}ispositive definite and|I(θ)|{\textstyle \,\left|\mathbf {I} (\theta )\right|\,} is finite. This ensures that thescore has a finite variance.[11]

The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed.

In Bayesian statistics, almost identical regularity conditions are imposed on the likelihood function in order to proof asymptotic normality of theposterior probability,[12][13] and therefore to justify aLaplace approximation of the posterior in large samples.[14]

Likelihood ratio and relative likelihood

[edit]
See also:Pseudo-R-squared

Likelihood ratio

[edit]
This section is about the likelihood ratio in general. For the use of likelihood ratios in interpreting diagnostic tests, seeLikelihood ratios in diagnostic testing. For the statistical test to compare goodness of fit, seeLikelihood-ratio test.

Alikelihood ratio is the ratio of any two specified likelihoods, frequently written as:Λ(θ1:θ2x)=L(θ1x)L(θ2x).{\displaystyle \Lambda (\theta _{1}:\theta _{2}\mid x)={\frac {{\mathcal {L}}(\theta _{1}\mid x)}{{\mathcal {L}}(\theta _{2}\mid x)}}.}

The likelihood ratio is central tolikelihoodist statistics: thelaw of likelihood states that the degree to which data (considered as evidence) supports one parameter value versus another is measured by the likelihood ratio.

Infrequentist inference, the likelihood ratio is the basis for atest statistic, the so-calledlikelihood-ratio test. By theNeyman–Pearson lemma, this is the mostpowerful test for comparing twosimple hypotheses at a givensignificance level. Numerous other tests can be viewed as likelihood-ratio tests or approximations thereof.[15] The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given byWilks' theorem.

The likelihood ratio is also of central importance inBayesian inference, where it is known as theBayes factor, and is used inBayes' rule. Stated in terms ofodds, Bayes' rule states that theposterior odds of two alternatives,A1{\displaystyle A_{1}} andA2{\displaystyle A_{2}}, given an eventB{\displaystyle B}, is theprior odds, times the likelihood ratio. As an equation:O(A1:A2B)=O(A1:A2)Λ(A1:A2B).{\displaystyle O(A_{1}:A_{2}\mid B)=O(A_{1}:A_{2})\cdot \Lambda (A_{1}:A_{2}\mid B).}

The likelihood ratio is not directly used in AIC-based statistics. Instead, what is used is the relative likelihood of models (see below).

Inevidence-based medicine, likelihood ratiosare used in diagnostic testing to assess the value of performing adiagnostic test.

Relative likelihood function

[edit]
See also:Relative likelihood

Since the actual value of the likelihood function depends on the sample, it is often convenient to work with a standardized measure. Suppose that themaximum likelihood estimate for the parameterθ isθ^{\textstyle {\hat {\theta }}}. Relative plausibilities of otherθ values may be found by comparing the likelihoods of those other values with the likelihood ofθ^{\textstyle {\hat {\theta }}}. Therelative likelihood ofθ is defined to be[16][17][18][19][20]R(θ)=L(θx)L(θ^x).{\displaystyle R(\theta )={\frac {{\mathcal {L}}(\theta \mid x)}{{\mathcal {L}}({\hat {\theta }}\mid x)}}.}Thus, the relative likelihood is the likelihood ratio (discussed above) with the fixed denominatorL(θ^){\textstyle {\mathcal {L}}({\hat {\theta }})}. This corresponds to standardizing the likelihood to have a maximum of 1.

Likelihood region

[edit]

Alikelihood region is the set of all values ofθ whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, ap% likelihood region forθ is defined to be[16][18][21]

{θ:R(θ)p100}.{\displaystyle \left\{\theta :R(\theta )\geq {\frac {p}{100}}\right\}.}

Ifθ is a single real parameter, ap% likelihood region will usually comprise aninterval of real values. If the region does comprise an interval, then it is called alikelihood interval.[16][18][22]

Likelihood intervals, and more generally likelihood regions, are used forinterval estimation within likelihoodist statistics: they are similar toconfidence intervals in frequentist statistics andcredible intervals in Bayesian statistics. Likelihood intervals are interpreted directly in terms of relative likelihood, not in terms ofcoverage probability (frequentism) orposterior probability (Bayesianism).

Given a model, likelihood intervals can be compared to confidence intervals. Ifθ is a single real parameter, then under certain conditions, a 14.65% likelihood interval (about 1:7 likelihood) forθ will be the same as a 95% confidence interval (19/20 coverage probability).[16][21] In a slightly different formulation suited to the use of log-likelihoods (seeWilks' theorem), the test statistic is twice the difference in log-likelihoods and the probability distribution of the test statistic is approximately achi-squared distribution with degrees-of-freedom (df) equal to the difference in df's between the two models (therefore, thee−2 likelihood interval is the same as the 0.954 confidence interval; assuming difference in df's to be 1).[21][22]

Likelihoods that eliminate nuisance parameters

[edit]

In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered asnuisance parameters. Several alternative approaches have been developed to eliminate such nuisance parameters, so that a likelihood can be written as a function of only the parameter (or parameters) of interest: the main approaches are profile, conditional, and marginal likelihoods.[23][24] These approaches are also useful when a high-dimensional likelihood surface needs to be reduced to one or two parameters of interest in order to allow agraph.

Profile likelihood

[edit]

It is possible to reduce the dimensions by concentrating the likelihood function for a subset of parameters by expressing the nuisance parameters as functions of the parameters of interest and replacing them in the likelihood function.[25][26] In general, for a likelihood function depending on the parameter vectorθ{\textstyle \mathbf {\theta } } that can be partitioned intoθ=(θ1:θ2){\textstyle \mathbf {\theta } =\left(\mathbf {\theta } _{1}:\mathbf {\theta } _{2}\right)}, and where a correspondenceθ^2=θ^2(θ1){\textstyle \mathbf {\hat {\theta }} _{2}=\mathbf {\hat {\theta }} _{2}\left(\mathbf {\theta } _{1}\right)} can be determined explicitly, concentration reducescomputational burden of the original maximization problem.[27]

For instance, in alinear regression with normally distributed errors,y=Xβ+u{\textstyle \mathbf {y} =\mathbf {X} \beta +u}, the coefficient vector could bepartitioned intoβ=[β1:β2]{\textstyle \beta =\left[\beta _{1}:\beta _{2}\right]} (and consequently thedesign matrixX=[X1:X2]{\textstyle \mathbf {X} =\left[\mathbf {X} _{1}:\mathbf {X} _{2}\right]}). Maximizing with respect toβ2{\textstyle \beta _{2}} yields an optimal value functionβ2(β1)=(X2TX2)1X2T(yX1β1){\textstyle \beta _{2}(\beta _{1})=\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}\left(\mathbf {y} -\mathbf {X} _{1}\beta _{1}\right)}. Using this result, the maximum likelihood estimator forβ1{\textstyle \beta _{1}} can then be derived asβ^1=(X1T(IP2)X1)1X1T(IP2)y{\displaystyle {\hat {\beta }}_{1}=\left(\mathbf {X} _{1}^{\mathsf {T}}\left(\mathbf {I} -\mathbf {P} _{2}\right)\mathbf {X} _{1}\right)^{-1}\mathbf {X} _{1}^{\mathsf {T}}\left(\mathbf {I} -\mathbf {P} _{2}\right)\mathbf {y} }whereP2=X2(X2TX2)1X2T{\textstyle \mathbf {P} _{2}=\mathbf {X} _{2}\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}} is theprojection matrix ofX2{\textstyle \mathbf {X} _{2}}. This result is known as theFrisch–Waugh–Lovell theorem.

Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameterβ2{\textstyle \beta _{2}} that maximizes the likelihood function, creating anisometricprofile of the likelihood function for a givenβ1{\textstyle \beta _{1}}, the result of this procedure is also known asprofile likelihood.[28][29] In addition to being graphed, the profile likelihood can also be used to computeconfidence intervals that often have better small-sample properties than those based on asymptoticstandard errors calculated from the full likelihood.[30][31]

Conditional likelihood

[edit]

Sometimes it is possible to find asufficient statistic for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters.[32]

One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the non-centralhypergeometric distribution. This form of conditioning is also the basis forFisher's exact test.

Marginal likelihood

[edit]
Main article:Marginal likelihood

Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linearmixed models, where considering a likelihood for the residuals only after fitting the fixed effects leads toresidual maximum likelihood estimation of the variance components.

Partial likelihood

[edit]

A partial likelihood is an adaption of the full likelihood such that only a part of the parameters (the parameters of interest) occur in it.[33] It is a key component of theproportional hazards model: using a restriction on the hazard function, the likelihood does not contain the shape of the hazard over time.

Products of likelihoods

[edit]

The likelihood, given two or moreindependentevents, is the product of the likelihoods of each of the individual events:Λ(AX1X2)=Λ(AX1)Λ(AX2).{\displaystyle \Lambda (A\mid X_{1}\land X_{2})=\Lambda (A\mid X_{1})\cdot \Lambda (A\mid X_{2}).}This follows from the definition of independence in probability: the probabilities of two independent events happening, given a model, is the product of the probabilities.

This is particularly important when the events are fromindependent and identically distributed random variables, such as independent observations orsampling with replacement. In such a situation, the likelihood function factors into a product of individual likelihood functions.

The empty product has value 1, which corresponds to the likelihood, given no event, being 1: before any data, the likelihood is always 1. This is similar to auniform prior in Bayesian statistics, but in likelihoodist statistics this is not animproper prior because likelihoods are not integrated.

Log-likelihood

[edit]
See also:Log-probability

Log-likelihood function is the logarithm of the likelihood function, often denoted by a lowercasel or{\displaystyle \ell }, to contrast with the uppercaseL orL{\textstyle {\mathcal {L}}} for the likelihood. Because logarithms arestrictly increasing functions, maximizing the likelihood is equivalent to maximizing the log-likelihood. But for practical purposes it is more convenient to work with the log-likelihood function inmaximum likelihood estimation, in particular since most commonprobability distributions—notably theexponential family—are onlylogarithmically concave,[34][35] andconcavity of theobjective function plays a key role in themaximization.

Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overalllog-probability is the sum of the log-probability of the individual events. In addition to the mathematical convenience from this, the adding process of log-likelihood has an intuitive interpretation, as often expressed as "support" from the data. When the parameters are estimated using the log-likelihood for themaximum likelihood estimation, each data point is used by being added to the total log-likelihood. As the data can be viewed as an evidence that support the estimated parameters, this process can be interpreted as "support from independent evidenceadds", and the log-likelihood is the "weight of evidence". Interpreting negative log-probability asinformation content orsurprisal, the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model.

A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods:logL(A)L(B)=logL(A)logL(B)=(A)(B).{\displaystyle \log {\frac {{\mathcal {L}}(A)}{{\mathcal {L}}(B)}}=\log {\mathcal {L}}(A)-\log {\mathcal {L}}(B)=\ell (A)-\ell (B).}

Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models.

Graph

[edit]

Thegraph of the log-likelihood is called thesupport curve (in theunivariate case).[36]In the multivariate case, the concept generalizes into asupport surface over theparameter space.It has a relation to, but is distinct from, thesupport of a distribution.

The term was coined byA. W. F. Edwards[36] in the context ofstatistical hypothesis testing, i.e. whether or not the data "support" one hypothesis (or parameter value) being tested more than any other.

The log-likelihood function being plotted is used in the computation of thescore (the gradient of the log-likelihood) andFisher information (the curvature of the log-likelihood). Thus, the graph has a direct interpretation in the context ofmaximum likelihood estimation andlikelihood-ratio tests.

Likelihood equations

[edit]

If the log-likelihood function issmooth, itsgradient with respect to the parameter, known as thescore and writtensn(θ)θn(θ){\textstyle s_{n}(\theta )\equiv \nabla _{\theta }\ell _{n}(\theta )}, exists and allows for the application ofdifferential calculus. The basic way to maximize a differentiable function is to find thestationary points (the points where thederivative is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires theproduct rule, it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events.

The equations defined by the stationary point of the score function serve asestimating equations for the maximum likelihood estimator.sn(θ)=0{\displaystyle s_{n}(\theta )=\mathbf {0} }In that sense, the maximum likelihood estimator is implicitly defined by the value at0{\textstyle \mathbf {0} } of theinverse functionsn1:EdΘ{\textstyle s_{n}^{-1}:\mathbb {E} ^{d}\to \Theta }, whereEd{\textstyle \mathbb {E} ^{d}} is thed-dimensionalEuclidean space, andΘ{\textstyle \Theta } is the parameter space. Using theinverse function theorem, it can be shown thatsn1{\textstyle s_{n}^{-1}} iswell-defined in anopen neighborhood about0{\textstyle \mathbf {0} } with probability going to one, andθ^n=sn1(0){\textstyle {\hat {\theta }}_{n}=s_{n}^{-1}(\mathbf {0} )} is a consistent estimate ofθ{\textstyle \theta }. As a consequence there exists a sequence{θ^n}{\textstyle \left\{{\hat {\theta }}_{n}\right\}} such thatsn(θ^n)=0{\textstyle s_{n}({\hat {\theta }}_{n})=\mathbf {0} } asymptoticallyalmost surely, andθ^npθ0{\textstyle {\hat {\theta }}_{n}\xrightarrow {\text{p}} \theta _{0}}.[37] A similar result can be established usingRolle's theorem.[38][39]

The second derivative evaluated atθ^{\textstyle {\hat {\theta }}}, known asFisher information, determines the curvature of the likelihood surface,[40] and thus indicates theprecision of the estimate.[41]

Exponential families

[edit]
Further information:Exponential family

The log-likelihood is also particularly useful forexponential families of distributions, which include many of the commonparametric probability distributions. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involvingexponentiation. The logarithm of such a function is a sum of products, again easier to differentiate than the original function.

An exponential family is one whose probability density function is of the form (for some functions, writing,{\textstyle \langle -,-\rangle } for theinner product):

p(xθ)=h(x)exp(η(θ),T(x)A(θ)).{\displaystyle p(x\mid {\boldsymbol {\theta }})=h(x)\exp {\Big (}\langle {\boldsymbol {\eta }}({\boldsymbol {\theta }}),\mathbf {T} (x)\rangle -A({\boldsymbol {\theta }}){\Big )}.}

Each of these terms has an interpretation,[a] but simply switching from probability to likelihood and taking logarithms yields the sum:

(θx)=η(θ),T(x)A(θ)+logh(x).{\displaystyle \ell ({\boldsymbol {\theta }}\mid x)=\langle {\boldsymbol {\eta }}({\boldsymbol {\theta }}),\mathbf {T} (x)\rangle -A({\boldsymbol {\theta }})+\log h(x).}

Theη(θ){\textstyle {\boldsymbol {\eta }}({\boldsymbol {\theta }})} andh(x){\textstyle h(x)} each correspond to achange of coordinates, so in these coordinates, the log-likelihood of an exponential family is given by the simple formula:

(ηx)=η,T(x)A(η).{\displaystyle \ell ({\boldsymbol {\eta }}\mid x)=\langle {\boldsymbol {\eta }},\mathbf {T} (x)\rangle -A({\boldsymbol {\eta }}).}

In words, the log-likelihood of an exponential family is inner product of the natural parameterη{\displaystyle {\boldsymbol {\eta }}} and thesufficient statisticT(x){\displaystyle \mathbf {T} (x)}, minus the normalization factor (log-partition function)A(η){\displaystyle A({\boldsymbol {\eta }})}. Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statisticT and the log-partition functionA.

Example: the gamma distribution

[edit]

Thegamma distribution is an exponential family with two parameters,α{\textstyle \alpha } andβ{\textstyle \beta }. The likelihood function is

L(α,βx)=βαΓ(α)xα1eβx.{\displaystyle {\mathcal {L}}(\alpha ,\beta \mid x)={\frac {\beta ^{\alpha }}{\Gamma (\alpha )}}x^{\alpha -1}e^{-\beta x}.}

Finding the maximum likelihood estimate ofβ{\textstyle \beta } for a single observed valuex{\textstyle x} looks rather daunting. Its logarithm is much simpler to work with:

logL(α,βx)=αlogβlogΓ(α)+(α1)logxβx.{\displaystyle \log {\mathcal {L}}(\alpha ,\beta \mid x)=\alpha \log \beta -\log \Gamma (\alpha )+(\alpha -1)\log x-\beta x.\,}

To maximize the log-likelihood, we first take thepartial derivative with respect toβ{\textstyle \beta }:

logL(α,βx)β=αβx.{\displaystyle {\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x)}{\partial \beta }}={\frac {\alpha }{\beta }}-x.}

If there are a number of independent observationsx1,,xn{\textstyle x_{1},\ldots ,x_{n}}, then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood:

logL(α,βx1,,xn)β=logL(α,βx1)β++logL(α,βxn)β=nαβi=1nxi.{\displaystyle {\begin{aligned}&{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{1},\ldots ,x_{n})}{\partial \beta }}\\&={\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{1})}{\partial \beta }}+\cdots +{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{n})}{\partial \beta }}\\&={\frac {n\alpha }{\beta }}-\sum _{i=1}^{n}x_{i}.\end{aligned}}}

To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved forβ{\textstyle \beta }:

β^=αx¯.{\displaystyle {\widehat {\beta }}={\frac {\alpha }{\bar {x}}}.}

Hereβ^{\textstyle {\widehat {\beta }}} denotes the maximum-likelihood estimate, andx¯=1ni=1nxi{\textstyle \textstyle {\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}} is thesample mean of the observations.

Background and interpretation

[edit]

Historical remarks

[edit]
See also:History of statistics andHistory of probability

The term "likelihood" has been in use in English since at least lateMiddle English.[42] Its formal use to refer to a specificfunction in mathematical statistics was proposed byRonald Fisher,[43] in two research papers published in 1921[44] and 1922.[45] The 1921 paper introduced what is today called a "likelihood interval"; the 1922 paper introduced the term "method of maximum likelihood". Quoting Fisher:

[I]n 1922, I proposed the term 'likelihood,' in view of the fact that, with respect to [the parameter], it is not a probability, and does not obey the laws of probability, while at the same time it bears to the problem of rational choice among the possible values of [the parameter] a relation similar to that which probability bears to the problem of predicting events in games of chance. . . . Whereas, however, in relation to psychological judgment, likelihood has some resemblance to probability, the two concepts are wholly distinct. . . ."[46]

The concept of likelihood should not be confused with probability as mentioned by Sir Ronald Fisher

I stress this because in spite of the emphasis that I have always laid upon the difference between probability and likelihood there is still a tendency to treat likelihood as though it were a sort of probability. The first result is thus that there are two different measures of rational belief appropriate to different cases. Knowing the population we can express our incomplete knowledge of, or expectation of, the sample in terms of probability; knowing the sample we can express our incomplete knowledge of the population in terms of likelihood.[47]

Fisher's invention of statistical likelihood was in reaction against an earlier form of reasoning calledinverse probability.[48] His use of the term "likelihood" fixed the meaning of the term within mathematical statistics.

A. W. F. Edwards (1972) established the axiomatic basis for use of the log-likelihood ratio as a measure of relative support for one hypothesis against another. Thesupport function is then the natural logarithm of the likelihood function. Both terms are used inphylogenetics, but were not adopted in a general treatment of the topic of statistical evidence.[49]

Interpretations under different foundations

[edit]

Among statisticians, there is no consensus about what thefoundation of statistics should be. There are four main paradigms that have been proposed for the foundation:frequentism,Bayesianism,likelihoodism, andAIC-based.[50] For each of the proposed foundations, the interpretation of likelihood is different. The four interpretations are described in the subsections below.

Frequentist interpretation

[edit]
[icon]
This section is empty. You can help byadding to it.(March 2019)

Bayesian interpretation

[edit]

InBayesian inference, although one can speak about the likelihood of any proposition orrandom variable given another random variable: for example the likelihood of a parameter value or of astatistical model (seemarginal likelihood), given specified data or other evidence,[51][52][53][54] the likelihood function remains the same entity, with the additional interpretations of (i) aconditional density of the data given the parameter (since the parameter is then a random variable) and (ii) a measure or amount of information brought by the data about the parameter value or even the model.[51][52][53][54][55] Due to the introduction of a probability structure on the parameter space or on the collection of models, it is possible that a parameter value or a statistical model have a large likelihood value for given data, and yet have a lowprobability, or vice versa.[53][55] This is often the case in medical contexts.[56] FollowingBayes' Rule, the likelihood when seen as a conditional density can be multiplied by theprior probability density of the parameter and then normalized, to give aposterior probability density.[51][52][53][54][55] More generally, the likelihood of an unknown quantityX{\textstyle X} given another unknown quantityY{\textstyle Y} is proportional to theprobability ofY{\textstyle Y} givenX{\textstyle X}.[51][52][53][54][55]

Likelihoodist interpretation

[edit]
This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(April 2019) (Learn how and when to remove this message)

In frequentist statistics, the likelihood function is itself astatistic that summarizes a single sample from a population, whose calculated value depends on a choice of several parametersθ1 ...θp, wherep is the count of parameters in some already-selectedstatistical model. The value of the likelihood serves as a figure of merit for the choice used for the parameters, and the parameter set with maximum likelihood is the best choice, given the data available.

The specific calculation of the likelihood is the probability that the observed sample would be assigned, assuming that the model chosen and the values of the several parametersθ give an accurate approximation of thefrequency distribution of the population that the observed sample was drawn from. Heuristically, it makes sense that a good choice of parameters is those which render the sample actually observed the maximum possiblepost-hoc probability of having happened.Wilks' theorem quantifies the heuristic rule by showing that the difference in the logarithm of the likelihood generated by the estimate's parameter values and the logarithm of the likelihood generated by population's "true" (but unknown) parameter values is asymptoticallyχ2 distributed.

Each independent sample's maximum likelihood estimate is a separate estimate of the "true" parameter set describing the population sampled. Successive estimates from many independent samples will cluster together with the population's "true" set of parameter values hidden somewhere in their midst. The difference in the logarithms of the maximum likelihood and adjacent parameter sets' likelihoods may be used to draw aconfidence region on a plot whose co-ordinates are the parametersθ1 ...θp. The region surrounds the maximum-likelihood estimate, and all points (parameter sets) within that region differ at most in log-likelihood by some fixed value. Theχ2 distribution given byWilks' theorem converts the region's log-likelihood differences into the "confidence" that the population's "true" parameter set lies inside. The art of choosing the fixed log-likelihood difference is to make the confidence acceptably high while keeping the region acceptably small (narrow range of estimates).

As more data are observed, instead of being used to make independent estimates, they can be combined with the previous samples to make a single combined sample, and that large sample may be used for a new maximum likelihood estimate. As the size of the combined sample increases, the size of the likelihood region with the same confidence shrinks. Eventually, either the size of the confidence region is very nearly a single point, or the entire population has been sampled; in both cases, the estimated parameter set is essentially the same as the population parameter set.

AIC-based interpretation

[edit]
[icon]
This sectionneeds expansion. You can help byadding to it.(March 2019)

Under theAIC paradigm, likelihood is interpreted within the context ofinformation theory.[57][58][59]

See also

[edit]

Notes

[edit]
  1. ^SeeExponential family § Interpretation

References

[edit]
  1. ^Casella, George; Berger, Roger L. (2002).Statistical Inference (2nd ed.). Duxbury. p. 290.ISBN 0-534-24312-6.
  2. ^Wakefield, Jon (2013).Frequentist and Bayesian Regression Methods (1st ed.). Springer. p. 36.ISBN 978-1-4419-0925-1.
  3. ^Lehmann, Erich L.; Casella, George (1998).Theory of Point Estimation (2nd ed.). Springer. p. 444.ISBN 0-387-98502-6.
  4. ^Zellner, Arnold (1971).An Introduction to Bayesian Inference in Econometrics. New York: Wiley. pp. 13–14.ISBN 0-471-98165-6.
  5. ^Billingsley, Patrick (1995).Probability and Measure (Third ed.).John Wiley & Sons. pp. 422–423.
  6. ^Shao, Jun (2003).Mathematical Statistics (2nd ed.). Springer. §4.4.1.
  7. ^Gouriéroux, Christian; Monfort, Alain (1995).Statistics and Econometric Models. New York: Cambridge University Press. p. 161.ISBN 0-521-40551-3.
  8. ^Mäkeläinen, Timo; Schmidt, Klaus; Styan, George P.H. (1981)."On the existence and uniqueness of the maximum likelihood estimate of a vector-valued parameter in fixed-size samples".Annals of Statistics.9 (4):758–767.doi:10.1214/aos/1176345516.JSTOR 2240844.
  9. ^Mascarenhas, W.F. (2011). "A mountain pass lemma and its implications regarding the uniqueness of constrained minimizers".Optimization.60 (8–9):1121–1159.doi:10.1080/02331934.2010.527973.S2CID 15896597.
  10. ^Chanda, K.C. (1954). "A note on the consistency and maxima of the roots of likelihood equations".Biometrika.41 (1–2):56–61.doi:10.2307/2333005.JSTOR 2333005.
  11. ^Greenberg, Edward; Webster, Charles E. Jr. (1983).Advanced Econometrics: A Bridge to the Literature. New York, NY: John Wiley & Sons. pp. 24–25.ISBN 0-471-09077-8.
  12. ^Heyde, C. C.; Johnstone, I. M. (1979). "On Asymptotic Posterior Normality for Stochastic Processes".Journal of the Royal Statistical Society. Series B (Methodological).41 (2):184–189.doi:10.1111/j.2517-6161.1979.tb01071.x.
  13. ^Chen, Chan-Fu (1985). "On Asymptotic Normality of Limiting Density Functions with Bayesian Implications".Journal of the Royal Statistical Society. Series B (Methodological).47 (3):540–546.doi:10.1111/j.2517-6161.1985.tb01384.x.
  14. ^Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". In Geisser, S.; Hodges, J. S.; Press, S. J.; Zellner, A. (eds.).Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488.ISBN 0-444-88376-2.
  15. ^Buse, A. (1982). "The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note".The American Statistician.36 (3a):153–157.doi:10.1080/00031305.1982.10482817.
  16. ^abcdKalbfleisch, J. G. (1985),Probability and Statistical Inference, Springer (§9.3).
  17. ^Azzalini, A. (1996),Statistical Inference—Based on the likelihood,Chapman & Hall,ISBN 9780412606502 (§1.4.2).
  18. ^abcSprott, D. A. (2000),Statistical Inference in Science, Springer (chap. 2).
  19. ^Davison, A. C. (2008),Statistical Models,Cambridge University Press (§4.1.2).
  20. ^Held, L.; Sabanés Bové, D. S. (2014),Applied Statistical Inference—Likelihood and Bayes, Springer (§2.1).
  21. ^abcRossi, R. J. (2018),Mathematical Statistics,Wiley, p. 267.
  22. ^abHudson, D. J. (1971), "Interval estimation from the likelihood function",Journal of the Royal Statistical Society, Series B,33 (2):256–262,doi:10.1111/j.2517-6161.1971.tb00877.x.
  23. ^Pawitan, Yudi (2001).In All Likelihood: Statistical Modelling and Inference Using Likelihood.Oxford University Press.
  24. ^Wen Hsiang Wei."Generalized Linear Model - course notes". Taichung, Taiwan:Tunghai University. pp. Chapter 5. Retrieved2017-10-01.
  25. ^Amemiya, Takeshi (1985)."Concentrated Likelihood Function".Advanced Econometrics. Cambridge: Harvard University Press. pp. 125–127.ISBN 978-0-674-00560-0.
  26. ^Davidson, Russell;MacKinnon, James G. (1993). "Concentrating the Loglikelihood Function".Estimation and Inference in Econometrics. New York: Oxford University Press. pp. 267–269.ISBN 978-0-19-506011-9.
  27. ^Gourieroux, Christian; Monfort, Alain (1995)."Concentrated Likelihood Function".Statistics and Econometric Models. New York: Cambridge University Press. pp. 170–175.ISBN 978-0-521-40551-5.
  28. ^Pickles, Andrew (1985).An Introduction to Likelihood Analysis. Norwich: W. H. Hutchins & Sons. pp. 21–24.ISBN 0-86094-190-6.
  29. ^Bolker, Benjamin M. (2008).Ecological Models and Data in R. Princeton University Press. pp. 187–189.ISBN 978-0-691-12522-0.
  30. ^Aitkin, Murray (1982). "Direct Likelihood Inference".GLIM 82: Proceedings of the International Conference on Generalised Linear Models. Springer. pp. 76–86.ISBN 0-387-90777-7.
  31. ^Venzon, D. J.; Moolgavkar, S. H. (1988). "A Method for Computing Profile-Likelihood-Based Confidence Intervals".Journal of the Royal Statistical Society. Series C (Applied Statistics).37 (1):87–94.doi:10.2307/2347496.JSTOR 2347496.
  32. ^Kalbfleisch, J. D.; Sprott, D. A. (1973). "Marginal and Conditional Likelihoods".Sankhyā: The Indian Journal of Statistics. Series A.35 (3):311–328.JSTOR 25049882.
  33. ^Cox, D. R. (1975). "Partial likelihood".Biometrika.62 (2):269–276.doi:10.1093/biomet/62.2.269.MR 0400509.
  34. ^Kass, Robert E.; Vos, Paul W. (1997).Geometrical Foundations of Asymptotic Inference. New York: John Wiley & Sons. p. 14.ISBN 0-471-82668-5.
  35. ^Papadopoulos, Alecos (September 25, 2013)."Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)?".Stack Exchange.
  36. ^abEdwards, A. W. F. (1992) [1972].Likelihood.Johns Hopkins University Press.ISBN 0-8018-4443-6.
  37. ^Foutz, Robert V. (1977). "On the Unique Consistent Solution to the Likelihood Equations".Journal of the American Statistical Association.72 (357):147–148.doi:10.1080/01621459.1977.10479926.
  38. ^Tarone, Robert E.; Gruenhage, Gary (1975). "A Note on the Uniqueness of Roots of the Likelihood Equations for Vector-Valued Parameters".Journal of the American Statistical Association.70 (352):903–904.doi:10.1080/01621459.1975.10480321.
  39. ^Rai, Kamta; Van Ryzin, John (1982). "A Note on a Multivariate Version of Rolle's Theorem and Uniqueness of Maximum Likelihood Roots".Communications in Statistics. Theory and Methods.11 (13):1505–1510.doi:10.1080/03610928208828325.
  40. ^Rao, B. Raja (1960). "A formula for the curvature of the likelihood surface of a sample drawn from a distribution admitting sufficient statistics".Biometrika.47 (1–2):203–207.doi:10.1093/biomet/47.1-2.203.
  41. ^Ward, Michael D.; Ahlquist, John S. (2018).Maximum Likelihood for Social Science : Strategies for Analysis.Cambridge University Press. pp. 25–27.
  42. ^"likelihood",Shorter Oxford English Dictionary (2007).
  43. ^Hald, A. (1999)."On the history of maximum likelihood in relation to inverse probability and least squares".Statistical Science.14 (2):214–222.doi:10.1214/ss/1009212248.JSTOR 2676741.
  44. ^Fisher, R.A. (1921). "On the "probable error" of a coefficient of correlation deduced from a small sample".Metron.1:3–32.
  45. ^Fisher, R.A. (1922)."On the mathematical foundations of theoretical statistics".Philosophical Transactions of the Royal Society A.222 (594–604):309–368.Bibcode:1922RSPTA.222..309F.doi:10.1098/rsta.1922.0009.hdl:2440/15172.JFM 48.1280.02.JSTOR 91208.
  46. ^Klemens, Ben (2008).Modeling with Data: Tools and Techniques for Scientific Computing.Princeton University Press. p. 329.
  47. ^Fisher, Ronald (1930). "Inverse Probability".Mathematical Proceedings of the Cambridge Philosophical Society.26 (4):528–535.Bibcode:1930PCPS...26..528F.doi:10.1017/S0305004100016297.
  48. ^Fienberg, Stephen E (1997). "Introduction to R.A. Fisher on inverse probability and likelihood".Statistical Science.12 (3): 161.doi:10.1214/ss/1030037905.
  49. ^Royall, R. (1997).Statistical Evidence.Chapman & Hall.
  50. ^Bandyopadhyay, P. S.; Forster, M. R., eds. (2011).Philosophy of Statistics.North-Holland Publishing.
  51. ^abcdI. J. Good:Probability and the Weighing of Evidence (Griffin 1950), §6.1
  52. ^abcdH. Jeffreys:Theory of Probability (3rd ed., Oxford University Press 1983), §1.22
  53. ^abcdeE. T. Jaynes:Probability Theory: The Logic of Science (Cambridge University Press 2003), §4.1
  54. ^abcdD. V. Lindley:Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 1: Probability (Cambridge University Press 1980), §1.6
  55. ^abcdA. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin:Bayesian Data Analysis (3rd ed., Chapman & Hall/CRC 2014), §1.3
  56. ^Sox, H. C.; Higgins, M. C.; Owens, D. K. (2013),Medical Decision Making (2nd ed.), Wiley, chapters 3–4,doi:10.1002/9781118341544,ISBN 9781118341544
  57. ^Akaike, H. (1985). "Prediction and entropy". In Atkinson, A. C.;Fienberg, S. E. (eds.).A Celebration of Statistics. Springer. pp. 1–24.
  58. ^Sakamoto, Y.; Ishiguro, M.; Kitagawa, G. (1986).Akaike Information Criterion Statistics.D. Reidel. Part I.
  59. ^Burnham, K. P.; Anderson, D. R. (2002).Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.).Springer-Verlag. chap. 7.

Further reading

[edit]

External links

[edit]
Look uplikelihood in Wiktionary, the free dictionary.
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Portal:
Retrieved from "https://en.wikipedia.org/w/index.php?title=Likelihood_function&oldid=1278603847#Log-likelihood"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp