This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(September 2008) (Learn how and when to remove this message) |
| Part of a series on |
| Bayesian statistics |
|---|
| Posterior =Likelihood ×Prior ÷Evidence |
| Background |
| Model building |
| Posterior approximation |
| Estimators |
| Evidence approximation |
| Model evaluation |
Theprinciple of maximum entropy states that theprobability distribution which best represents the current state of knowledge about a system is the one with largestentropy, in the context of precisely stated prior data (such as aproposition that expressestestable information).
Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximalinformation entropy is the best choice.
The principle was first expounded byE. T. Jaynes in two papers in 1957,[1][2] where he emphasized a natural correspondence betweenstatistical mechanics andinformation theory. In particular, Jaynes argued that the Gibbsian method of statistical mechanics is sound by also arguing that theentropy of statistical mechanics and theinformation entropy ofinformation theory are the same concept. Consequently,statistical mechanics should be considered a particular application of a general tool of logicalinference and information theory.
In most practical cases, the stated prior data or testable information is given by a set ofconserved quantities (average values of some moment functions), associated with theprobability distribution in question. This is the way the maximum entropy principle is most often used instatistical thermodynamics. Another possibility is to prescribe somesymmetries of the probability distribution. The equivalence betweenconserved quantities and correspondingsymmetry groups implies a similar equivalence for these two ways of specifying the testable information in the maximum entropy method.
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods,statistical mechanics andlogical inference in particular.
The maximum entropy principle makes explicit our freedom in using different forms ofprior data. As a special case, a uniformprior probability density (Laplace'sprinciple of indifference, sometimes called the principle of insufficient reason), may be adopted. Thus, the maximum entropy principle is not merely an alternative way to view the usual methods of inference of classical statistics, but represents a significant conceptual generalization of those methods.
However these statements do not imply that thermodynamical systems need not be shown to beergodic to justify treatment as astatistical ensemble.
In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
The principle of maximum entropy is useful explicitly only when applied totestable information. Testable information is a statement about a probability distribution whose truth or falsity is well-defined. For example, the statements
and
(where and are probabilities of events) are statements of testable information.
Given testable information, the maximum entropy procedure consists of seeking theprobability distribution which maximizesinformation entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method ofLagrange multipliers.[3]
Entropy maximization with no testable information respects the universal "constraint" that the sum of the probabilities is one. Under this constraint, the maximum entropy discrete probability distribution is theuniform distribution,
The principle of maximum entropy is commonly applied in two ways to inferential problems:
The principle of maximum entropy is often used to obtainprior probability distributions forBayesian inference. Jaynes was a strong advocate of this approach, claiming the maximum entropy distribution represented the least informative distribution.[4]A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links withchannel coding.[5][6][7][8]
Maximum entropy is a sufficient updating rule forradical probabilism.Richard Jeffrey'sprobability kinematics is a special case ofmaximum entropy inference. However, maximum entropy is not a generalisation of all such sufficient updating rules.[9]
Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information. Such models are widely used innatural language processing. An example of such a model islogistic regression, which corresponds to themaximum entropy classifier for independent observations.
The maximum entropy principle has also been applied in economics and resource allocation. For example, theBoltzmann fair division model uses the maximum entropy (Boltzmann) distribution to allocate resources or income among individuals, providing a probabilistic approach to distributive justice.[10]
One of the main applications of the maximum entropy principle is in discrete and continuousdensity estimation.[11][12]Similar tosupport vector machine estimators, the maximum entropy principle may require the solution to aquadratic programming problem, and thus provide a sparse mixture model as the optimal density estimator. One important advantage of the method is its ability to incorporate prior information in the density estimation.[13]
We have some testable informationI about a quantityx taking values in {x1,x2,...,xn}. We assume this information has the form ofm constraints on the expectations of the functionsfk; that is, we require our probability distribution to satisfy the moment inequality/equality constraints:
where the are observables. We also require the probability density to sum to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint
The probability distribution with maximum information entropy subject to these inequality/equality constraints is of the form:[11]
for some. It is sometimes called theGibbs distribution. The normalization constant is determined by:
and is conventionally called thepartition function. (ThePitman–Koopman theorem states that the necessary and sufficient condition for asampling distribution to admitsufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.)
Theλk parameters are Lagrange multipliers. In the case of equality constraints their values are determined from the solution of the nonlinear equations
In the case of inequality constraints, the Lagrange multipliers are determined from the solution of aconvex optimization program with linear constraints.[11] In both cases, there is noclosed form solution, and the computation of the Lagrange multipliers usually requiresnumerical methods.
Forcontinuous distributions, the Shannon entropy cannot be used, as it is only defined for discrete probability spaces. InsteadEdwin Jaynes (1963, 1968, 2003) gave the following formula, which is closely related to therelative entropy (see alsodifferential entropy).
whereq(x), which Jaynes called the "invariant measure", is proportional to thelimiting density of discrete points. For now, we shall assume thatq is known; we will discuss it further after the solution equations are given.
A closely related quantity, the relative entropy, is usually defined as theKullback–Leibler divergence ofp fromq (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as thePrinciple of Minimum Discrimination Information.
We have some testable informationI about a quantityx which takes values in someinterval of thereal numbers (all integrals below are over this interval). We assume this information has the form ofm constraints on the expectations of the functionsfk, i.e. we require our probability density function to satisfy the inequality (or purely equality) moment constraints:
where the are observables. We also require the probability density to integrate to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint
The probability density function with maximumHc subject to these constraints is:[14]
with thepartition function determined by
As in the discrete case, in the case where all moment constraints are equalities, the values of the parameters are determined by the system of nonlinear equations:
In the case with inequality moment constraints the Lagrange multipliers are determined from the solution of aconvex optimization program.[12]
The invariant measure functionq(x) can be best understood by supposing thatx is known to take values only in thebounded interval (a,b), and that no other information is given. Then the maximum entropy probability density function is
whereA is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as theprinciple of transformation groups ormarginalization theory.
For several examples of maximum entropy distributions, see the article onmaximum entropy probability distributions.
Proponents of the principle of maximum entropy justify its use in assigning probabilities in several ways, including the following two arguments. These arguments take the use ofBayesian probability as given, and are thus subject to the same postulates.
Consider adiscrete probability distribution among mutually exclusivepropositions. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value,. The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is, ranging from zero (completely informative) to (completely uninformative).
By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess. Thus the maximum entropy distribution is the only reasonable distribution. Thedependence of the solution on the dominating measure represented by is however a source of criticisms of the approach since this dominating measure is in fact arbitrary.[15]
The following argument is the result of a suggestion made byGraham Wallis to E. T. Jaynes in 1962.[16] It is essentially the same mathematical argument used for theMaxwell–Boltzmann statistics instatistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumeda priori, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.
Suppose an individual wishes to make a probability assignment amongmutually exclusive propositions. They have some testable information, but are not sure how to go about including this information in their probability assessment. They therefore conceive of the following random experiment. They will distribute quanta of probability (each worth) at random among the possibilities. (One might imagine that they will throw balls into buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, they will check if the probability assignment thus obtained is consistent with their information. (For this step to be successful, the information must be a constraint given by anopen set in the space of probability measures). If it is inconsistent, they will reject it and try again. If it is consistent, their assessment will be
where is the probability of theth proposition, whileni is the number of quanta that were assigned to theth proposition (i.e. the number of balls that ended up in bucket).
Now, in order to reduce the 'graininess' of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is themultinomial distribution,
where
is sometimes known as the multiplicity of the outcome.
The most probable result is the one which maximizes the multiplicity. Rather than maximizing directly, the protagonist could equivalently maximize any monotonic increasing function of. They decide to maximize
At this point, in order to simplify the expression, the protagonist takes the limit as, i.e. as the probability levels go from grainy discrete values to smooth continuous values. UsingStirling's approximation, they find
All that remains for the protagonist to do is to maximize entropy under the constraints of their testable information. They have found that the maximum entropy distribution is the most probable of all "fair" random distributions, in the limit as the probability levels go from discrete to continuous.
Giffin and Caticha (2007) state thatBayes' theorem and the principle of maximum entropy are completely compatible and can be seen as special cases of the "method of maximum relative entropy". They state that this method reproduces every aspect of orthodox Bayesian inference methods. In addition this new method opens the door to tackling problems that could not be addressed by either the maximal entropy principle or orthodox Bayesian methods individually. Moreover, recent contributions (Lazar 2003, and Schennach 2005) show that frequentist relative-entropy-based inference approaches (such asempirical likelihood andexponentially tilted empirical likelihood – see e.g. Owen 2001 and Kitamura 2006) can be combined with prior information to perform Bayesian posterior analysis.
Jaynes stated Bayes' theorem was a way to calculate a probability, while maximum entropy was a way to assign a prior probability distribution.[17]
It is however, possible in concept to solve for a posterior distribution directly from a stated prior distribution using theprinciple of minimum cross-entropy (or the Principle of Maximum Entropy being a special case of using auniform distribution as the given prior), independently of any Bayesian considerations by treating the problem formally as a constrained optimisation problem, the Entropy functional being the objective function. For the case of given average values as testable information (averaged over the sought after probability distribution), the sought after distribution is formally theGibbs (or Boltzmann) distribution the parameters of which must be solved for in order to achieve minimum cross entropy and satisfy the given testable information.
The principle of maximum entropy bears a relation to a key assumption ofkinetic theory of gases known asmolecular chaos orStosszahlansatz. This asserts that the distribution function characterizing particles entering a collision can be factorized. Though this statement can be understood as a strictly physical hypothesis, it can also be interpreted as a heuristic hypothesis regarding the most probable configuration of particles before colliding.[18]