Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Gibbs sampling

From Wikipedia, the free encyclopedia
Monte Carlo algorithm
Part of a series on
Bayesian statistics
Posterior =Likelihood ×Prior ÷Evidence
Background
Model building
Posterior approximation
Estimators
Evidence approximation
Model evaluation

Instatistics,Gibbs sampling or aGibbs sampler is aMarkov chain Monte Carlo (MCMC)algorithm for sampling from a specifiedmultivariateprobability distribution when direct sampling from the joint distribution is difficult, but sampling from theconditional distribution is more practical. This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate themarginal distribution of one of the variables, or some subset of the variables (for example, the unknownparameters orlatent variables); or to compute anintegral (such as theexpected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

Gibbs sampling is commonly used as a means ofstatistical inference, especiallyBayesian inference. It is arandomized algorithm (i.e. an algorithm that makes use ofrandom numbers), and is an alternative todeterministic algorithms for statistical inference such as theexpectation–maximization algorithm (EM).

As with other MCMC algorithms, Gibbs sampling generates aMarkov chain of samples, each of which iscorrelated with nearby samples. As a result, care must be taken if independent samples are desired. Samples from the beginning of the chain (theburn-in period) may not accurately represent the desired distribution and are usually discarded.

Introduction

[edit]

Gibbs sampling is named after the physicistJosiah Willard Gibbs, in reference to an analogy between thesampling algorithm andstatistical physics. The algorithm was described by brothersStuart andDonald Geman in 1984, some eight decades after the death of Gibbs,[1] and became popularized in the statistics community for calculating marginal probability distribution, especially the posterior distribution.[2]

In its basic version, Gibbs sampling is a special case of theMetropolis–Hastings algorithm. However, in its extended versions (seebelow), it can be considered a general framework for sampling from a large set of variables by sampling each variable (or in some cases, each group of variables) in turn, and can incorporate theMetropolis–Hastings algorithm (or methods such asslice sampling) to implement one or more of the sampling steps.

Gibbs sampling is applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but theconditional distribution of each variable is known and is easy (or at least, easier) to sample from. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. It can be shown that the sequence of samples constitutes aMarkov chain, and the stationary distribution of that Markov chain is just the sought-after joint distribution.[3]

Gibbs sampling is particularly well-adapted to sampling theposterior distribution of aBayesian network, since Bayesian networks are typically specified as a collection of conditional distributions.

Implementation

[edit]

Gibbs sampling, in its basic incarnation, is a special case of theMetropolis–Hastings algorithm. The point of Gibbs sampling is that given amultivariate distribution it is simpler to sample from a conditional distribution than tomarginalize by integrating over ajoint distribution. Suppose we want to obtaink{\displaystyle k} samples of an{\displaystyle n}-dimensional random vectorX=(X1,,Xn){\displaystyle \mathbf {X} =(X_{1},\dots ,X_{n})}. We proceed iteratively:

Properties

[edit]

If such sampling is performed, these important facts hold:

  • The samples approximate the joint distribution of all variables.
  • The marginal distribution of any subset of variables can be approximated by simply considering the samples for that subset of variables, ignoring the rest.
  • Theexpected value of any variable can be approximated by averaging over all the samples.

When performing the sampling:

  • The initial values of the variables can be determined randomly or by some other algorithm such asexpectation–maximization.
  • It is not actually necessary to determine an initial value for the first variable sampled.
  • It is common to ignore some number of samples at the beginning (the so-calledburn-in period), and then consider only everyn{\displaystyle n}th sample when averaging values to compute an expectation. For example, the first 1,000 samples might be ignored, and then every 100th sample averaged, throwing away all the rest. The reason for this is that (1) thestationary distribution of the Markov chain is the desired joint distribution over the variables, but it may take a while for that stationary distribution to be reached; (2) successive samples are not independent of each other but form aMarkov chain with some amount of correlation. Sometimes, algorithms can be used to determine the amount ofautocorrelation between samples and the value ofn{\displaystyle n} (the period between samples that are actually used) computed from this, but in practice there is a fair amount of "black magic" involved.[citation needed]
  • The process ofsimulated annealing is often used to reduce the "random walk" behavior in the early part of the sampling process (i.e. the tendency to move slowly around thesample space, with a high amount ofautocorrelation between samples, rather than moving around quickly, as is desired). Other techniques that may reduce autocorrelation arecollapsed Gibbs sampling,blocked Gibbs sampling, andordered overrelaxation; see below.

Relation of conditional distribution and joint distribution

[edit]

Furthermore, the conditional distribution of one variable given all others is proportional to the joint distribution, i.e., for all possible value(xi)1in{\displaystyle (x_{i})_{1\leq i\leq n}} ofX{\displaystyle \mathbf {X} }:

P(Xj=xj(Xi=xi)ij)=P((Xi=xi)i)P((Xi=xi)ij)P((Xi=xi)i){\displaystyle P(X_{j}=x_{j}\mid (X_{i}=x_{i})_{i\neq j})={\frac {P((X_{i}=x_{i})_{i})}{P((X_{i}=x_{i})_{i\neq j})}}\propto P((X_{i}=x_{i})_{i})}

"Proportional to" in this case means that the denominator is not a function ofxj{\displaystyle x_{j}} and thus is the same for all values ofxj{\displaystyle x_{j}}; it forms part of thenormalization constant for the distribution overxj{\displaystyle x_{j}}. In practice, to determine the nature of the conditional distribution of a factorxj{\displaystyle x_{j}}, it is easiest to factor the joint distribution according to the individual conditional distributions defined by thegraphical model over the variables, ignore all factors that are not functions ofxj{\displaystyle x_{j}} (all of which, together with the denominator above, constitute the normalization constant), and then reinstate the normalization constant at the end, as necessary. In practice, this means doing one of three things:

  1. If the distribution is discrete, the individual probabilities of all possible values ofxj{\displaystyle x_{j}} are computed, and then summed to find the normalization constant.
  2. If the distribution is continuous and of a known form, the normalization constant will also be known.
  3. In other cases, the normalization constant can usually be ignored, as most sampling methods do not require it.

Inference

[edit]

Gibbs sampling is commonly used forstatistical inference (e.g. determining the best value of a parameter, such as determining the number of people likely to shop at a particular store on a given day, the candidate a voter will most likely vote for, etc.). The idea is that observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. The distribution of the remaining variables is then effectively aposterior distribution conditioned on the observed data.

The most likely value of a desired parameter (themode) could then simply be selected by choosing the sample value that occurs most commonly; this is essentially equivalent tomaximum a posteriori estimation of a parameter. (Since the parameters are usually continuous, it is often necessary to "bin" the sampled values into one of a finite number of ranges or "bins" in order to get a meaningful estimate of the mode.) More commonly, however, theexpected value (mean or average) of the sampled values is chosen; this is aBayes estimator that takes advantage of the additional data about the entire distribution that is available from Bayesian sampling, whereas a maximization algorithm such asexpectation maximization (EM) is capable of only returning a single point from the distribution. For example, for a unimodal distribution the mean (expected value) is usually similar to the mode (most common value), but if the distribution isskewed in one direction, the mean will be moved in that direction, which effectively accounts for the extra probability mass in that direction. (If a distribution is multimodal, the expected value may not return a meaningful point, and any of the modes is typically a better choice.)

Although some of the variables typically correspond to parameters of interest, others are uninteresting ("nuisance") variables introduced into the model to properly express the relationships among variables. Although the sampled values represent thejoint distribution over all variables, the nuisance variables can simply be ignored when computing expected values or modes; this is equivalent tomarginalizing over the nuisance variables. When a value for multiple variables is desired, the expected value is simply computed over each variable separately. (When computing the mode, however, all variables must be considered together.)

Supervised learning,unsupervised learning andsemi-supervised learning (aka learning with missing values) can all be handled by simply fixing the values of all variables whose values are known, and sampling from the remainder.

For observed data, there will be one variable for each observation—rather than, for example, one variable corresponding to thesample mean orsample variance of a set of observations. In fact, there generally will be no variables at all corresponding to concepts such as "sample mean" or "sample variance". Instead, in such a case there will be variables representing the unknown true mean and true variance, and the determination of sample values for these variables results automatically from the operation of the Gibbs sampler.

Generalized linear models (i.e. variations oflinear regression) can sometimes be handled by Gibbs sampling as well. For example,probit regression for determining the probability of a given binary (yes/no) choice, withnormally distributed priors placed over the regression coefficients, can be implemented with Gibbs sampling because it is possible to add additional variables and take advantage ofconjugacy. However,logistic regression cannot be handled this way. One possibility is to approximate thelogistic function with a mixture (typically 7–9) of normal distributions. More commonly, however,Metropolis–Hastings is used instead of Gibbs sampling.

Mathematical background

[edit]

Suppose that a sampleX{\displaystyle \left.X\right.} is taken from a distribution depending on a parameter vectorθΘ{\displaystyle \theta \in \Theta \,\!} of lengthd{\displaystyle \left.d\right.}, with prior distributiong(θ1,,θd){\displaystyle g(\theta _{1},\ldots ,\theta _{d})}. It may be thatd{\displaystyle \left.d\right.} is very large and thatnumerical integration to find the marginal densities of theθi{\displaystyle \left.\theta _{i}\right.} would be computationally expensive. Then an alternative method of calculating the marginal densities is to create a Markov chain on the spaceΘ{\displaystyle \left.\Theta \right.} by repeating these two steps:

  1. Pick a random index1jd{\displaystyle 1\leq j\leq d}
  2. Pick a new value forθj{\displaystyle \left.\theta _{j}\right.} according tog(θ1,,θj1,,θj+1,,θd){\displaystyle g(\theta _{1},\ldots ,\theta _{j-1},\,\cdot \,,\theta _{j+1},\ldots ,\theta _{d})}

These steps define areversible Markov chain with the desired invariant distributiong{\displaystyle \left.g\right.}. Thiscan be proved as follows. Definexjy{\displaystyle x\sim _{j}y} ifxi=yi{\displaystyle \left.x_{i}=y_{i}\right.} for allij{\displaystyle i\neq j} and letpxy{\displaystyle \left.p_{xy}\right.} denote the probability of a jump fromxΘ{\displaystyle x\in \Theta } toyΘ{\displaystyle y\in \Theta }. Then, the transition probabilities are

pxy={1dg(y)zΘ:zjxg(z)xjy0otherwise{\displaystyle p_{xy}={\begin{cases}{\frac {1}{d}}{\frac {g(y)}{\sum _{z\in \Theta :z\sim _{j}x}g(z)}}&x\sim _{j}y\\0&{\text{otherwise}}\end{cases}}}

So

g(x)pxy=1dg(x)g(y)zΘ:zjxg(z)=1dg(y)g(x)zΘ:zjyg(z)=g(y)pyx{\displaystyle g(x)p_{xy}={\frac {1}{d}}{\frac {g(x)g(y)}{\sum _{z\in \Theta :z\sim _{j}x}g(z)}}={\frac {1}{d}}{\frac {g(y)g(x)}{\sum _{z\in \Theta :z\sim _{j}y}g(z)}}=g(y)p_{yx}}

sincexjy{\displaystyle x\sim _{j}y} is anequivalence relation. Thus thedetailed balance equations are satisfied, implying the chain is reversible and it has invariant distributiong{\displaystyle \left.g\right.}.

In practice, the indexj{\displaystyle \left.j\right.} is not chosen at random, and the chain cycles through the indexes in order. In general this gives a non-stationary Markov process, but each individual step will still be reversible, and the overall process will still have the desired stationary distribution (as long as the chain can access all states under the fixed ordering).

Gibbs sampler in Bayesian inference and its relation to information theory

[edit]

Lety{\displaystyle y} denote observations generated from the sampling distributionf(y|θ){\displaystyle f(y|\theta )} andπ(θ){\displaystyle \pi (\theta )} be a prior supported on the parameter spaceΘ{\displaystyle \Theta }. Then one of the central goals of theBayesian statistics is to approximate the posterior density

π(θ|y)=f(y|θ)π(θ)m(y){\displaystyle \pi (\theta |y)={\frac {f(y|\theta )\cdot \pi (\theta )}{m(y)}}}

where the marginal likelihoodm(y)=Θf(y|θ)π(θ)dθ{\displaystyle m(y)=\int _{\Theta }f(y|\theta )\cdot \pi (\theta )d\theta } is assumed to be finite for ally{\displaystyle y}.

To explain the Gibbs sampler, we additionally assume that the parameter spaceΘ{\displaystyle \Theta } is decomposed as

Θ=i=1KΘi=Θ1×Θi××ΘK,(K>1){\displaystyle \Theta =\prod _{i=1}^{K}\Theta _{i}=\Theta _{1}\times \cdots \Theta _{i}\times \cdots \times \Theta _{K},\quad \quad (K>1)},

where×{\displaystyle \times } represents theCartesian product. Each component parameter spaceΘi{\displaystyle \Theta _{i}} can be a set of scalar components, subvectors, or matrices.

Define a setΘi{\displaystyle \Theta _{-i}} that complements theΘi{\displaystyle \Theta _{i}}. Essential ingredients of the Gibbs sampler is thei{\displaystyle i}-th full conditional posterior distribution for eachi=1,,K{\displaystyle i=1,\cdots ,K}

π(θi|θi,y)=π(θi|θ1,,θi1,θi+1,,θK,y){\displaystyle \pi (\theta _{i}|\theta _{-i},y)=\pi (\theta _{i}|\theta _{1},\cdots ,\theta _{i-1},\theta _{i+1},\cdots ,\theta _{K},y)}.
A pictorial description of the Gibbs sampling algorithm[4]
Schematic description of the information equality associated with the Gibbs sampler at the i-th step within a cycle[4]

The following algorithm details a generic Gibbs sampler:

Initialize: pick arbitrary starting valueθ(1)=(θ1(1),θ2(1),,θi(1),θi+1(1),,θK(1)){\displaystyle {\text{Initialize: pick arbitrary starting value}}\,\,\theta ^{(1)}=(\theta _{1}^{(1)},\theta _{2}^{(1)},\cdots ,\theta _{i}^{(1)},\theta _{i+1}^{(1)},\cdots ,\theta _{K}^{(1)})}

Iterate a Cycle:{\displaystyle {\text{Iterate a Cycle:}}\,}

Step 1. drawθ1(s+1)π(θ1|θ2(s),θ3(s),,θK(s),y){\displaystyle \quad \quad {\text{Step 1. draw}}\,\,\theta _{1}^{(s+1)}\sim \pi (\theta _{1}|\theta _{2}^{(s)},\theta _{3}^{(s)},\cdots ,\theta _{K}^{(s)},y)}

Step 2. drawθ2(s+1)π(θ2|θ1(s+1),θ3(s),,θK(s),y){\displaystyle \quad \quad {\text{Step 2. draw}}\,\,\theta _{2}^{(s+1)}\sim \pi (\theta _{2}|\theta _{1}^{(s+1)},\theta _{3}^{(s)},\cdots ,\theta _{K}^{(s)},y)}

{\displaystyle \quad \quad \quad \vdots }

Step i. drawθi(s+1)π(θi|θ1(s+1),θ2(s+1),,θi1(s+1),θi+1(s),,θK(s),y){\displaystyle \quad \quad {\text{Step i. draw}}\,\,\theta _{i}^{(s+1)}\sim \pi (\theta _{i}|\theta _{1}^{(s+1)},\theta _{2}^{(s+1)},\cdots ,\theta _{i-1}^{(s+1)},\theta _{i+1}^{(s)},\cdots ,\theta _{K}^{(s)},y)}

Step i+1. drawθi+1(s+1)π(θi+1|θ1(s+1),θ2(s+1),,θi(s+1),θi+2(s),,θK(s),y){\displaystyle \quad \quad {\text{Step i+1. draw}}\,\,\theta _{i+1}^{(s+1)}\sim \pi (\theta _{i+1}|\theta _{1}^{(s+1)},\theta _{2}^{(s+1)},\cdots ,\theta _{i}^{(s+1)},\theta _{i+2}^{(s)},\cdots ,\theta _{K}^{(s)},y)}

{\displaystyle \quad \quad \quad \vdots }

Step K. drawθK(s+1)π(θK|θ1(s+1),θ2(s+1),,θK1(s+1),y){\displaystyle \quad \quad {\text{Step K. draw}}\,\,\theta _{K}^{(s+1)}\sim \pi (\theta _{K}|\theta _{1}^{(s+1)},\theta _{2}^{(s+1)},\cdots ,\theta _{K-1}^{(s+1)},y)}

end Iterate{\displaystyle {\text{end Iterate}}}

Note that Gibbs sampler is operated by the iterative Monte Carlo scheme within a cycle. TheS{\displaystyle S} number of samples{θ(s)}s=1S{\displaystyle \{\theta ^{(s)}\}_{s=1}^{S}} drawn by the above algorithm formulates Markov Chains with the invariant distribution to be the target densityπ(θ|y){\displaystyle \pi (\theta |y)}.

Now, for eachi=1,,K{\displaystyle i=1,\cdots ,K}, define the following information theoretic quantities:

I(θi;θi)=KL(π(θ|y)||π(θi|y)π(θi|y))=Θπ(θ|y)log(π(θ|y)π(θi|y)π(θi|y))dθ,{\displaystyle I(\theta _{i};\theta _{-i})={\text{KL}}(\pi (\theta |y)||\pi (\theta _{i}|y)\cdot \pi (\theta _{-i}|y))=\int _{\Theta }\pi (\theta |y)\log {\bigg (}{\frac {\pi (\theta |y)}{\pi (\theta _{i}|y)\cdot \pi (\theta _{-i}|y)}}{\bigg )}d\theta ,}

H(θi)=Θiπ(θi|y)logπ(θi|y)dθi,{\displaystyle H(\theta _{-i})=-\int _{\Theta _{-i}}\pi (\theta _{-i}|y)\log \pi (\theta _{-i}|y)d\theta _{-i},}

H(θi|θi)=Θπ(θ|y)logπ(θi|θi,y)dθ,{\displaystyle H(\theta _{-i}|\theta _{i})=-\int _{\Theta }\pi (\theta |y)\log \pi (\theta _{-i}|\theta _{i},y)d\theta ,}

namely, posteriormutual information, posterior differential entropy, and posterior conditional differential entropy, respectively. We can similarly define information theoretic quantitiesI(θi;θi){\displaystyle I(\theta _{-i};\theta _{i})},H(θi){\displaystyle H(\theta _{i})}, andH(θi|θi){\displaystyle H(\theta _{i}|\theta _{-i})} by interchanging thei{\displaystyle i} andi{\displaystyle -i} in the defined quantities. Then, the followingK{\displaystyle K} equations hold.[4]

I(θi;θi)=H(θi)H(θi|θi)=H(θi)H(θi|θi)=I(θi;θi),(i=1,,K){\displaystyle I(\theta _{i};\theta _{-i})=H(\theta _{-i})-H(\theta _{-i}|\theta _{i})=H(\theta _{i})-H(\theta _{i}|\theta _{-i})=I(\theta _{-i};\theta _{i}),\quad (i=1,\cdots ,K)}.

The mutual informationI(θi;θi){\displaystyle I(\theta _{i};\theta _{-i})} quantifies the reduction in uncertainty of random quantityθi{\displaystyle \theta _{i}} once we knowθi{\displaystyle \theta _{-i}}, a posteriori. It vanishes if and only ifθi{\displaystyle \theta _{i}} andθi{\displaystyle \theta _{-i}} are marginally independent, a posterior. The mutual informationI(θi;θi){\displaystyle I(\theta _{i};\theta _{-i})} can be interpreted as the quantity that is transmitted from thei{\displaystyle i}-th step to thei+1{\displaystyle i+1}-th step within a single cycle of the Gibbs sampler.

Variations and extensions

[edit]

Numerous variations of the basic Gibbs sampler exist. The goal of these variations is to reduce theautocorrelation between samples sufficiently to overcome any added computational costs.

Blocked Gibbs sampler

[edit]

Collapsed Gibbs sampler

[edit]
  • Acollapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable. For example, imagine that a model consists of three variablesA,B, andC. A simple Gibbs sampler would sample fromp(A | B,C), thenp(B | A,C), thenp(C | A,B). A collapsed Gibbs sampler might replace the sampling step forA with a sample taken from the marginal distributionp(A | C), with variableB integrated out in this case. Alternatively, variableB could be collapsed out entirely, alternately sampling fromp(A | C) andp(C | A) and not sampling overB at all. The distribution over a variableA that arises when collapsing a parent variableB is called acompound distribution; sampling from this distribution is generally tractable whenB is theconjugate prior forA, particularly whenA andB are members of theexponential family. For more information, see the article oncompound distributions or Liu (1994).[5]

Implementing a collapsed Gibbs sampler

[edit]
Collapsing Dirichlet distributions
[edit]

Inhierarchical Bayesian models withcategorical variables, such aslatent Dirichlet allocation and various other models used innatural language processing, it is quite common to collapse out theDirichlet distributions that are typically used asprior distributions over the categorical variables. The result of this collapsing introduces dependencies among all the categorical variables dependent on a given Dirichlet prior, and the joint distribution of these variables after collapsing is aDirichlet-multinomial distribution. The conditional distribution of a given categorical variable in this distribution, conditioned on the others, assumes an extremely simple form that makes Gibbs sampling even easier than if the collapsing had not been done. The rules are as follows:

  1. Collapsing out a Dirichlet prior node affects only the parent and children nodes of the prior. Since the parent is often a constant, it is typically only the children that we need to worry about.
  2. Collapsing out a Dirichlet prior introduces dependencies among all the categorical children dependent on that prior — butno extra dependencies among any other categorical children. (This is important to keep in mind, for example, when there are multiple Dirichlet priors related by the same hyperprior. Each Dirichlet prior can be independently collapsed and affects only its direct children.)
  3. After collapsing, the conditional distribution of one dependent children on the others assumes a very simple form: The probability of seeing a given value is proportional to the sum of the corresponding hyperprior for this value, and the count of all of theother dependent nodes assuming the same value. Nodes not dependent on the same priormust not be counted. The same rule applies in other iterative inference methods, such asvariational Bayes orexpectation maximization; however, if the method involves keeping partial counts, then the partial counts for the value in question must be summed across all the other dependent nodes. Sometimes this summed up partial count is termed theexpected count or similar. The probability isproportional to the resulting value; the actual probability must be determined by normalizing across all the possible values that the categorical variable can take (i.e. adding up the computed result for each possible value of the categorical variable, and dividing all the computed results by this sum).
  4. If a given categorical node has dependent children (e.g. when it is alatent variable in amixture model), the value computed in the previous step (expected count plus prior, or whatever is computed) must be multiplied by the actual conditional probabilities (not a computed value that is proportional to the probability!) of all children given their parents. See the article on theDirichlet-multinomial distribution for a detailed discussion.
  5. In the case where the group membership of the nodes dependent on a given Dirichlet prior may change dynamically depending on some other variable (e.g. a categorical variable indexed by another latent categorical variable, as in atopic model), the same expected counts are still computed, but need to be done carefully so that the correct set of variables is included. See the article on theDirichlet-multinomial distribution for more discussion, including in the context of a topic model.
Collapsing other conjugate priors
[edit]

In general, any conjugate prior can be collapsed out, if its only children have distributions conjugate to it. The relevant math is discussed in the article oncompound distributions. If there is only one child node, the result will often assume a known distribution. For example, collapsing aninverse-gamma-distributedvariance out of a network with a singleGaussian child will yield aStudent's t-distribution. (For that matter, collapsing both the mean and variance of a single Gaussian child will still yield a Student's t-distribution, provided both are conjugate, i.e. Gaussian mean, inverse-gamma variance.)

If there are multiple child nodes, they will all become dependent, as in theDirichlet-categorical case. The resultingjoint distribution will have a closed form that resembles in some ways the compound distribution, although it will have a product of a number of factors, one for each child node, in it.

In addition, and most importantly, the resultingconditional distribution of one of the child nodes given the others (and also given the parents of the collapsed node(s), butnot given the children of the child nodes) will have the same density as theposterior predictive distribution of all the remaining child nodes. Furthermore, the posterior predictive distribution has the same density as the basic compound distribution of a single node, although with different parameters. The general formula is given in the article oncompound distributions.

For example, given a Bayes network with a set of conditionallyindependent identically distributedGaussian-distributed nodes withconjugate prior distributions placed on the mean and variance, the conditional distribution of one node given the others after compounding out both the mean and variance will be aStudent's t-distribution. Similarly, the result of compounding out thegamma prior of a number ofPoisson-distributed nodes causes the conditional distribution of one node given the others to assume anegative binomial distribution.

In these cases where compounding produces a well-known distribution, efficient sampling procedures often exist, and using them will often (although not necessarily) be more efficient than not collapsing, and instead sampling both prior and child nodes separately. However, in the case where the compound distribution is not well-known, it may not be easy to sample from, since it generally will not belong to theexponential family and typically will not belog-concave (which would make it easy to sample usingadaptive rejection sampling, since a closed form always exists).

In the case where the child nodes of the collapsed nodes themselves have children, the conditional distribution of one of these child nodes given all other nodes in the graph will have to take into account the distribution of these second-level children. In particular, the resulting conditional distribution will be proportional to a product of the compound distribution as defined above, and the conditional distributions of all of the child nodes given their parents (but not given their own children). This follows from the fact that the full conditional distribution is proportional to the joint distribution. If the child nodes of the collapsed nodes arecontinuous, this distribution will generally not be of a known form, and may well be difficult to sample from despite the fact that a closed form can be written, for the same reasons as described above for non-well-known compound distributions. However, in the particular case that the child nodes arediscrete, sampling is feasible, regardless of whether the children of these child nodes are continuous or discrete. In fact, the principle involved here is described in fair detail in the article on theDirichlet-multinomial distribution.

Gibbs sampler with ordered overrelaxation

[edit]

Other extensions

[edit]

It is also possible to extend Gibbs sampling in various ways. For example, in the case of variables whose conditional distribution is not easy to sample from, a single iteration ofslice sampling or theMetropolis–Hastings algorithm can be used to sample from the variables in question.It is also possible to incorporate variables that are notrandom variables, but whose value isdeterministically computed from other variables.Generalized linear models, e.g.logistic regression (aka "maximum entropy models"), can be incorporated in this fashion. (BUGS, for example, allows this type of mixing of models.)

Failure modes

[edit]

There are two ways that Gibbs sampling can fail. The first is when there are islands of high-probability states, with no paths between them. For example, consider a probability distribution over 2-bit vectors, where the vectors (0,0) and (1,1) each have probability1/2, but the other two vectors (0,1) and (1,0) have probability zero. Gibbs sampling will become trapped in one of the two high-probability vectors, and will never reach the other one. More generally, for any distribution over high-dimensional, real-valued vectors, if two particular elements of the vector are perfectly correlated (or perfectly anti-correlated), those two elements will become stuck, and Gibbs sampling will never be able to change them.

The second problem can happen even when all states have nonzero probability and there is only a single island of high-probability states. For example, consider a probability distribution over 100-bit vectors, where the all-zeros vector occurs with probability1/2, and all other vectors are equally probable, and so have a probability of12(21001){\displaystyle {\frac {1}{2(2^{100}-1)}}} each. If you want to estimate the probability of the zero vector, it would be sufficient to take 100 or 1000 samples from the true distribution. That would very likely give an answer very close to1/2. But you would probably have to take more than2100{\displaystyle 2^{100}} samples from Gibbs sampling to get the same result. No computer could do this in a lifetime.

This problem occurs no matter how long the burn-in period is. This is because in the true distribution, the zero vector occurs half the time, and those occurrences are randomly mixed in with the nonzero vectors. Even a small sample will see both zero and nonzero vectors. But Gibbs sampling will alternate between returning only the zero vector for long periods (about299{\displaystyle 2^{99}} in a row), then only nonzero vectors for long periods (about299{\displaystyle 2^{99}} in a row). Thus convergence to the true distribution is extremely slow, requiring much more than299{\displaystyle 2^{99}} steps; taking this many steps is not computationally feasible in a reasonable time period. The slow convergence here can be seen as a consequence of thecurse of dimensionality.A problem like this can be solved by block sampling the entire 100-bit vector at once. (This assumes that the 100-bit vector is part of a larger set of variables. If this vector is the only thing being sampled, then block sampling is equivalent to not doing Gibbs sampling at all, which by hypothesis would be difficult.)

Software

[edit]
  • JAGS (Just another Gibbs sampler) is a GPL program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo.
  • Church is free software for performing Gibbs inference over arbitrary distributions that are specified as probabilistic programs.

Notes

[edit]
  1. ^Geman, S.;Geman, D. (1984). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images".IEEE Transactions on Pattern Analysis and Machine Intelligence.6 (6):721–741.Bibcode:1984ITPAM...6..721G.doi:10.1109/TPAMI.1984.4767596.PMID 22499653.
  2. ^Gelfand, Alan E.; Smith, Adrian F. M. (1990-06-01)."Sampling-Based Approaches to Calculating Marginal Densities".Journal of the American Statistical Association.85 (410):398–409.doi:10.1080/01621459.1990.10476213.ISSN 0162-1459.
  3. ^Gelman, Andrew and Carlin, John B and Stern, Hal S and Dunson, David B and Vehtari, Aki and Rubin, Donald B (2014).Bayesian data analysis. Vol. 2. FL: CRC press Boca Raton.{{cite book}}: CS1 maint: multiple names: authors list (link)
  4. ^abcLee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review".Communications in Statistics - Theory and Methods.51 (6):1549–1568.arXiv:2008.01006.doi:10.1080/03610926.2021.1921214.S2CID 220935477.
  5. ^Liu, Jun S. (September 1994). "The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem".Journal of the American Statistical Association.89 (427):958–966.doi:10.2307/2290921.JSTOR 2290921.
  6. ^Neal, Radford M. (1995).Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation (Technical report). University of Toronto, Department of Statistics.arXiv:bayes-an/9506004.Bibcode:1995bayes.an..6004N.

References

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Gibbs_sampling&oldid=1334579806"
Category:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp