Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Variational Bayesian methods

From Wikipedia, the free encyclopedia
Mathematical methods used in Bayesian inference and machine learning
For the method of approximation in quantum mechanics, seeVariational method (quantum mechanics).
Part of a series on
Bayesian statistics
Posterior =Likelihood ×Prior ÷Evidence
Background
Model building
Posterior approximation
Estimators
Evidence approximation
Model evaluation

Variational Bayesian methods are a family of techniques for approximating intractableintegrals arising inBayesian inference andmachine learning. They are typically used in complexstatistical models consisting of observed variables (usually termed "data") as well as unknownparameters andlatent variables, with various sorts of relationships among the three types ofrandom variables, as might be described by agraphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to theposterior probability of the unobserved variables, in order to dostatistical inference over these variables.
  2. To derive alower bound for themarginal likelihood (sometimes called theevidence) of the observed data (i.e. themarginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performingmodel selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also theBayes factor article.)

In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative toMonte Carlo sampling methods—particularly,Markov chain Monte Carlo methods such asGibbs sampling—for taking a fully Bayesian approach tostatistical inference over complexdistributions that are difficult to evaluate directly orsample. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.

Variational Bayes can be seen as an extension of theexpectation–maximization (EM) algorithm frommaximum likelihood (ML) ormaximum a posteriori (MAP) estimation of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entireposterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.

For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.

Mathematical derivation

[edit]

Problem

[edit]

Invariational inference, the posterior distribution over a set of unobserved variablesZ={Z1Zn}{\displaystyle \mathbf {Z} =\{Z_{1}\dots Z_{n}\}} given some dataX{\displaystyle \mathbf {X} } is approximated by a so-calledvariational distribution,Q(Z):{\displaystyle Q(\mathbf {Z} ):}

P(ZX)Q(Z).{\displaystyle P(\mathbf {Z} \mid \mathbf {X} )\approx Q(\mathbf {Z} ).}

The distributionQ(Z){\displaystyle Q(\mathbf {Z} )} is restricted to belong to a family of distributions of simpler form thanP(ZX){\displaystyle P(\mathbf {Z} \mid \mathbf {X} )} (e.g. a family of Gaussian distributions), selected with the intention of makingQ(Z){\displaystyle Q(\mathbf {Z} )} similar to the true posterior,P(ZX){\displaystyle P(\mathbf {Z} \mid \mathbf {X} )}.

The similarity (or dissimilarity) is measured in terms of a dissimilarity functiond(Q;P){\displaystyle d(Q;P)} and hence inference is performed by selecting the distributionQ(Z){\displaystyle Q(\mathbf {Z} )} that minimizesd(Q;P){\displaystyle d(Q;P)}.

KL divergence

[edit]

The most common type of variational Bayes uses theKullback–Leibler divergence (KL-divergence) ofQ fromP as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as

DKL(QP)ZQ(Z)logQ(Z)P(ZX).{\displaystyle D_{\mathrm {KL} }(Q\parallel P)\triangleq \sum _{\mathbf {Z} }Q(\mathbf {Z} )\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} \mid \mathbf {X} )}}.}

Note thatQ andP are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to theexpectation–maximization algorithm. (Using the KL-divergence in the other way produces theexpectation propagation algorithm.)

Intractability

[edit]

Variational techniques are typically used to form an approximation for:

P(ZX)=P(XZ)P(Z)P(X)=P(XZ)P(Z)ZP(X,Z)dZ{\displaystyle P(\mathbf {Z} \mid \mathbf {X} )={\frac {P(\mathbf {X} \mid \mathbf {Z} )P(\mathbf {Z} )}{P(\mathbf {X} )}}={\frac {P(\mathbf {X} \mid \mathbf {Z} )P(\mathbf {Z} )}{\int _{\mathbf {Z} }P(\mathbf {X} ,\mathbf {Z} ')\,d\mathbf {Z} '}}}

The marginalization overZ{\displaystyle \mathbf {Z} } to calculateP(X){\displaystyle P(\mathbf {X} )} in the denominator is typically intractable, because, for example, the search space ofZ{\displaystyle \mathbf {Z} } is combinatorially large. Therefore, we seek an approximation, usingQ(Z)P(ZX){\displaystyle Q(\mathbf {Z} )\approx P(\mathbf {Z} \mid \mathbf {X} )}.

Evidence lower bound

[edit]
Main article:Evidence lower bound

Given thatP(ZX)=P(X,Z)P(X){\displaystyle P(\mathbf {Z} \mid \mathbf {X} )={\frac {P(\mathbf {X} ,\mathbf {Z} )}{P(\mathbf {X} )}}}, the KL-divergence above can also be written as

DKL(QP)=ZQ(Z)[logQ(Z)P(Z,X)+logP(X)]=ZQ(Z)[logQ(Z)logP(Z,X)]+ZQ(Z)[logP(X)]{\displaystyle {\begin{array}{rl}D_{\mathrm {KL} }(Q\parallel P)&=\sum _{\mathbf {Z} }Q(\mathbf {Z} )\left[\log {\frac {Q(\mathbf {Z} )}{P(\mathbf {Z} ,\mathbf {X} )}}+\log P(\mathbf {X} )\right]\\&=\sum _{\mathbf {Z} }Q(\mathbf {Z} )\left[\log Q(\mathbf {Z} )-\log P(\mathbf {Z} ,\mathbf {X} )\right]+\sum _{\mathbf {Z} }Q(\mathbf {Z} )\left[\log P(\mathbf {X} )\right]\end{array}}}

BecauseP(X){\displaystyle P(\mathbf {X} )} is a constant with respect toZ{\displaystyle \mathbf {Z} } andZQ(Z)=1{\displaystyle \sum _{\mathbf {Z} }Q(\mathbf {Z} )=1} becauseQ(Z){\displaystyle Q(\mathbf {Z} )} is a distribution, we have

DKL(QP)=ZQ(Z)[logQ(Z)logP(Z,X)]+logP(X){\displaystyle D_{\mathrm {KL} }(Q\parallel P)=\sum _{\mathbf {Z} }Q(\mathbf {Z} )\left[\log Q(\mathbf {Z} )-\log P(\mathbf {Z} ,\mathbf {X} )\right]+\log P(\mathbf {X} )}

which, according to the definition ofexpected value (for a discreterandom variable), can be written as follows

DKL(QP)=EQ[logQ(Z)logP(Z,X)]+logP(X){\displaystyle D_{\mathrm {KL} }(Q\parallel P)=\mathbb {E} _{\mathbf {Q} }\left[\log Q(\mathbf {Z} )-\log P(\mathbf {Z} ,\mathbf {X} )\right]+\log P(\mathbf {X} )}

which can be rearranged to become

logP(X)=DKL(QP)EQ[logQ(Z)logP(Z,X)]=DKL(QP)+L(Q){\displaystyle {\begin{array}{rl}\log P(\mathbf {X} )&=D_{\mathrm {KL} }(Q\parallel P)-\mathbb {E} _{\mathbf {Q} }\left[\log Q(\mathbf {Z} )-\log P(\mathbf {Z} ,\mathbf {X} )\right]\\&=D_{\mathrm {KL} }(Q\parallel P)+{\mathcal {L}}(Q)\end{array}}}

As thelog-evidencelogP(X){\displaystyle \log P(\mathbf {X} )} is fixed with respect toQ{\displaystyle Q}, maximizing the final termL(Q){\displaystyle {\mathcal {L}}(Q)} minimizes the KL divergence ofQ{\displaystyle Q} fromP{\displaystyle P}. By appropriate choice ofQ{\displaystyle Q},L(Q){\displaystyle {\mathcal {L}}(Q)} becomes tractable to compute and to maximize. Hence we have both an analytical approximationQ{\displaystyle Q} for the posteriorP(ZX){\displaystyle P(\mathbf {Z} \mid \mathbf {X} )}, and a lower boundL(Q){\displaystyle {\mathcal {L}}(Q)} for the log-evidencelogP(X){\displaystyle \log P(\mathbf {X} )} (since the KL-divergence is non-negative).

The lower boundL(Q){\displaystyle {\mathcal {L}}(Q)} is known as the (negative)variational free energy in analogy withthermodynamic free energy because it can also be expressed as a negative energyEQ[logP(Z,X)]{\displaystyle \operatorname {E} _{Q}[\log P(\mathbf {Z} ,\mathbf {X} )]} plus theentropy ofQ{\displaystyle Q}. The termL(Q){\displaystyle {\mathcal {L}}(Q)} is also known asEvidence Lower Bound, abbreviated asELBO, to emphasize that it is a lower (worst-case) bound on the log-evidence of the data.

Proofs

[edit]

By the generalizedPythagorean theorem ofBregman divergence, of which KL-divergence is a special case, it can be shown that:[1][2]

Generalized Pythagorean theorem forBregman divergence[2]
DKL(QP)DKL(QQ)+DKL(QP),QC{\displaystyle D_{\mathrm {KL} }(Q\parallel P)\geq D_{\mathrm {KL} }(Q\parallel Q^{*})+D_{\mathrm {KL} }(Q^{*}\parallel P),\forall Q^{*}\in {\mathcal {C}}}

whereC{\displaystyle {\mathcal {C}}} is aconvex set and the equality holds if:

Q=QargminQCDKL(QP).{\displaystyle Q=Q^{*}\triangleq \arg \min _{Q\in {\mathcal {C}}}D_{\mathrm {KL} }(Q\parallel P).}

In this case, the global minimizerQ(Z)=q(Z1Z2)q(Z2)=q(Z2Z1)q(Z1),{\displaystyle Q^{*}(\mathbf {Z} )=q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})q^{*}(\mathbf {Z} _{2})=q^{*}(\mathbf {Z} _{2}\mid \mathbf {Z} _{1})q^{*}(\mathbf {Z} _{1}),} withZ={Z1,Z2},{\displaystyle \mathbf {Z} =\{\mathbf {Z_{1}} ,\mathbf {Z_{2}} \},} can be found as follows:[1]

q(Z2)=P(X)ζ(X)P(Z2X)exp(DKL(q(Z1Z2)P(Z1Z2,X)))=1ζ(X)expEq(Z1Z2)(logP(Z,X)q(Z1Z2)),{\displaystyle {\begin{array}{rl}q^{*}(\mathbf {Z} _{2})&={\frac {P(\mathbf {X} )}{\zeta (\mathbf {X} )}}{\frac {P(\mathbf {Z} _{2}\mid \mathbf {X} )}{\exp(D_{\mathrm {KL} }(q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})\parallel P(\mathbf {Z} _{1}\mid \mathbf {Z} _{2},\mathbf {X} )))}}\\&={\frac {1}{\zeta (\mathbf {X} )}}\exp \mathbb {E} _{q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})}\left(\log {\frac {P(\mathbf {Z} ,\mathbf {X} )}{q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})}}\right),\end{array}}}

in which the normalizing constant is:

ζ(X)=P(X)Z2P(Z2X)exp(DKL(q(Z1Z2)P(Z1Z2,X)))=Z2expEq(Z1Z2)(logP(Z,X)q(Z1Z2)).{\displaystyle {\begin{array}{rl}\zeta (\mathbf {X} )&=P(\mathbf {X} )\int _{\mathbf {Z} _{2}}{\frac {P(\mathbf {Z} _{2}\mid \mathbf {X} )}{\exp(D_{\mathrm {KL} }(q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})\parallel P(\mathbf {Z} _{1}\mid \mathbf {Z} _{2},\mathbf {X} )))}}\\&=\int _{\mathbf {Z} _{2}}\exp \mathbb {E} _{q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})}\left(\log {\frac {P(\mathbf {Z} ,\mathbf {X} )}{q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})}}\right).\end{array}}}

The termζ(X){\displaystyle \zeta (\mathbf {X} )} is often called theevidence lower bound (ELBO) in practice, sinceP(X)ζ(X)=exp(L(Q)){\displaystyle P(\mathbf {X} )\geq \zeta (\mathbf {X} )=\exp({\mathcal {L}}(Q^{*}))},[1] as shown above.

By interchanging the roles ofZ1{\displaystyle \mathbf {Z} _{1}} andZ2,{\displaystyle \mathbf {Z} _{2},} we can iteratively compute the approximatedq(Z1){\displaystyle q^{*}(\mathbf {Z} _{1})} andq(Z2){\displaystyle q^{*}(\mathbf {Z} _{2})} of the true model's marginalsP(Z1X){\displaystyle P(\mathbf {Z} _{1}\mid \mathbf {X} )} andP(Z2X),{\displaystyle P(\mathbf {Z} _{2}\mid \mathbf {X} ),} respectively. Although this iterative scheme is guaranteed to converge monotonically,[1] the convergedQ{\displaystyle Q^{*}} is only a local minimizer ofDKL(QP){\displaystyle D_{\mathrm {KL} }(Q\parallel P)}.

If the constrained spaceC{\displaystyle {\mathcal {C}}} is confined within independent space, i.e.q(Z1Z2)=q(Z1),{\displaystyle q^{*}(\mathbf {Z} _{1}\mid \mathbf {Z} _{2})=q^{*}(\mathbf {Z_{1}} ),}the above iterative scheme will become the so-called mean field approximationQ(Z)=q(Z1)q(Z2),{\displaystyle Q^{*}(\mathbf {Z} )=q^{*}(\mathbf {Z} _{1})q^{*}(\mathbf {Z} _{2}),}as shown below.

Mean field approximation

[edit]

The variational distributionQ(Z){\displaystyle Q(\mathbf {Z} )} is usually assumed to factorize over somepartition of the latent variables, i.e. for some partition of the latent variablesZ{\displaystyle \mathbf {Z} } intoZ1ZM{\displaystyle \mathbf {Z} _{1}\dots \mathbf {Z} _{M}},

Q(Z)=i=1Mqi(ZiX){\displaystyle Q(\mathbf {Z} )=\prod _{i=1}^{M}q_{i}(\mathbf {Z} _{i}\mid \mathbf {X} )}

It can be shown using thecalculus of variations (hence the name "variational Bayes") that the "best" distributionqj{\displaystyle q_{j}^{*}} for each of the factorsqj{\displaystyle q_{j}} (in terms of the distribution minimizing the KL divergence, as described above) satisfies:[3]

qj(ZjX)=eEqj[lnp(Z,X)]eEqj[lnp(Z,X)]dZj{\displaystyle q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )={\frac {e^{\operatorname {E} _{q_{-j}^{*}}[\ln p(\mathbf {Z} ,\mathbf {X} )]}}{\int e^{\operatorname {E} _{q_{-j}^{*}}[\ln p(\mathbf {Z} ,\mathbf {X} )]}\,d\mathbf {Z} _{j}}}}

whereEqj[lnp(Z,X)]{\displaystyle \operatorname {E} _{q_{-j}^{*}}[\ln p(\mathbf {Z} ,\mathbf {X} )]} is theexpectation of the logarithm of thejoint probability of the data and latent variables, taken with respect toq{\displaystyle q^{*}} over all variables not in the partition: refer to Lemma 4.1 of[4] for a derivation of the distributionqj(ZjX){\displaystyle q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )}.

In practice, we usually work in terms of logarithms, i.e.:

lnqj(ZjX)=Eqj[lnp(Z,X)]+constant{\displaystyle \ln q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )=\operatorname {E} _{q_{-j}^{*}}[\ln p(\mathbf {Z} ,\mathbf {X} )]+{\text{constant}}}

The constant in the above expression is related to thenormalizing constant (the denominator in the expression above forqj{\displaystyle q_{j}^{*}}) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g.Gaussian,gamma, etc.).

Using the properties of expectations, the expressionEqj[lnp(Z,X)]{\displaystyle \operatorname {E} _{q_{-j}^{*}}[\ln p(\mathbf {Z} ,\mathbf {X} )]} can usually be simplified into a function of the fixedhyperparameters of theprior distributions over the latent variables and of expectations (and sometimes highermoments such as thevariance) of latent variables not in the current partition (i.e. latent variables not included inZj{\displaystyle \mathbf {Z} _{j}}). This createscircular dependencies between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests aniterative algorithm, much like EM (theexpectation–maximization algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed toconverge.[5]

In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution'sfunctional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. themeans); sometimes expectations of squared variables (which can be related to thevariance of the variables), or expectations of higher powers (i.e. highermoments) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual,nonlinear dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.

A duality formula for variational inference

[edit]
Pictorial illustration of coordinate ascent variational inference algorithm by the duality formula[4]

The following theorem is referred to as a duality formula for variational inference.[4] It explains some important properties of the variational distributions used in variational Bayes methods.

Theorem Consider twoprobability spaces(Θ,F,P){\displaystyle (\Theta ,{\mathcal {F}},P)} and(Θ,F,Q){\displaystyle (\Theta ,{\mathcal {F}},Q)} withQP{\displaystyle Q\ll P}. Assume that there is a common dominatingprobability measureλ{\displaystyle \lambda } such thatPλ{\displaystyle P\ll \lambda } andQλ{\displaystyle Q\ll \lambda }. Leth{\displaystyle h} denote any real-valuedrandom variable on(Θ,F,P){\displaystyle (\Theta ,{\mathcal {F}},P)} that satisfieshL1(P){\displaystyle h\in L_{1}(P)}. Then the following equality holds

logEP[exph]=supQP{EQ[h]DKL(QP)}.{\displaystyle \log E_{P}[\exp h]={\text{sup}}_{Q\ll P}\{E_{Q}[h]-D_{\text{KL}}(Q\parallel P)\}.}

Further, the supremum on the right-hand side is attainedif and only if it holds

q(θ)p(θ)=exph(θ)EP[exph],{\displaystyle {\frac {q(\theta )}{p(\theta )}}={\frac {\exp h(\theta )}{E_{P}[\exp h]}},}

almost surely with respect to probability measureQ{\displaystyle Q}, wherep(θ)=dP/dλ{\displaystyle p(\theta )=dP/d\lambda } andq(θ)=dQ/dλ{\displaystyle q(\theta )=dQ/d\lambda } denote the Radon–Nikodym derivatives of the probability measuresP{\displaystyle P} andQ{\displaystyle Q} with respect toλ{\displaystyle \lambda }, respectively.

A basic example

[edit]

Consider a simple non-hierarchical Bayesian model consisting of a set ofi.i.d. observations from aGaussian distribution, with unknownmean andvariance.[6] In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.

For mathematical convenience, in the following example we work in terms of theprecision — i.e. the reciprocal of the variance (or in a multivariate Gaussian, the inverse of thecovariance matrix) — rather than the variance itself. (From a theoretical standpoint, precision and variance are equivalent since there is aone-to-one correspondence between the two.)

The mathematical model

[edit]

We placeconjugate prior distributions on the unknown meanμ{\displaystyle \mu } and precisionτ{\displaystyle \tau }, i.e. the mean also follows a Gaussian distribution while the precision follows agamma distribution. In other words:

τGamma(a0,b0)μ|τN(μ0,(λ0τ)1){x1,,xN}N(μ,τ1)N=number of data points{\displaystyle {\begin{aligned}\tau &\sim \operatorname {Gamma} (a_{0},b_{0})\\\mu |\tau &\sim {\mathcal {N}}(\mu _{0},(\lambda _{0}\tau )^{-1})\\\{x_{1},\dots ,x_{N}\}&\sim {\mathcal {N}}(\mu ,\tau ^{-1})\\N&={\text{number of data points}}\end{aligned}}}

Thehyperparametersμ0,λ0,a0{\displaystyle \mu _{0},\lambda _{0},a_{0}} andb0{\displaystyle b_{0}} in the prior distributions are fixed, given values. They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions ofμ{\displaystyle \mu } andτ{\displaystyle \tau }.

We are givenN{\displaystyle N} data pointsX={x1,,xN}{\displaystyle \mathbf {X} =\{x_{1},\ldots ,x_{N}\}} and our goal is to infer theposterior distributionq(μ,τ)=p(μ,τx1,,xN){\displaystyle q(\mu ,\tau )=p(\mu ,\tau \mid x_{1},\ldots ,x_{N})} of the parametersμ{\displaystyle \mu } andτ.{\displaystyle \tau .}

The joint probability

[edit]

Thejoint probability of all variables can be rewritten as

p(X,μ,τ)=p(Xμ,τ)p(μτ)p(τ){\displaystyle p(\mathbf {X} ,\mu ,\tau )=p(\mathbf {X} \mid \mu ,\tau )p(\mu \mid \tau )p(\tau )}

where the individual factors are

p(Xμ,τ)=n=1NN(xnμ,τ1)p(μτ)=N(μμ0,(λ0τ)1)p(τ)=Gamma(τa0,b0){\displaystyle {\begin{aligned}p(\mathbf {X} \mid \mu ,\tau )&=\prod _{n=1}^{N}{\mathcal {N}}(x_{n}\mid \mu ,\tau ^{-1})\\p(\mu \mid \tau )&={\mathcal {N}}\left(\mu \mid \mu _{0},(\lambda _{0}\tau )^{-1}\right)\\p(\tau )&=\operatorname {Gamma} (\tau \mid a_{0},b_{0})\end{aligned}}}

where

N(xμ,σ2)=12πσ2e(xμ)22σ2Gamma(τa,b)=1Γ(a)baτa1ebτ{\displaystyle {\begin{aligned}{\mathcal {N}}(x\mid \mu ,\sigma ^{2})&={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{\frac {-(x-\mu )^{2}}{2\sigma ^{2}}}\\\operatorname {Gamma} (\tau \mid a,b)&={\frac {1}{\Gamma (a)}}b^{a}\tau ^{a-1}e^{-b\tau }\end{aligned}}}

Factorized approximation

[edit]

Assume thatq(μ,τ)=q(μ)q(τ){\displaystyle q(\mu ,\tau )=q(\mu )q(\tau )}, i.e. that the posterior distribution factorizes into independent factors forμ{\displaystyle \mu } andτ{\displaystyle \tau }. This type of assumption underlies the variational Bayesian method. The true posterior distribution does not in fact factor this way (in fact, in this simple case, it is known to be aGaussian-gamma distribution), and hence the result we obtain will be an approximation.

Derivation ofq(μ)

[edit]

Then

lnqμ(μ)=Eτ[lnp(Xμ,τ)+lnp(μτ)+lnp(τ)]+C=Eτ[lnp(Xμ,τ)]+Eτ[lnp(μτ)]+Eτ[lnp(τ)]+C=Eτ[lnn=1NN(xnμ,τ1)]+Eτ[lnN(μμ0,(λ0τ)1)]+C2=Eτ[lnn=1Nτ2πe(xnμ)2τ2]+Eτ[lnλ0τ2πe(μμ0)2λ0τ2]+C2=Eτ[n=1N(12(lnτln2π)(xnμ)2τ2)]+Eτ[12(lnλ0+lnτln2π)(μμ0)2λ0τ2]+C2=Eτ[n=1N(xnμ)2τ2]+Eτ[(μμ0)2λ0τ2]+Eτ[n=1N12(lnτln2π)]+Eτ[12(lnλ0+lnτln2π)]+C2=Eτ[n=1N(xnμ)2τ2]+Eτ[(μμ0)2λ0τ2]+C3=Eτ[τ]2{n=1N(xnμ)2+λ0(μμ0)2}+C3{\displaystyle {\begin{aligned}\ln q_{\mu }^{*}(\mu )&=\operatorname {E} _{\tau }\left[\ln p(\mathbf {X} \mid \mu ,\tau )+\ln p(\mu \mid \tau )+\ln p(\tau )\right]+C\\&=\operatorname {E} _{\tau }\left[\ln p(\mathbf {X} \mid \mu ,\tau )\right]+\operatorname {E} _{\tau }\left[\ln p(\mu \mid \tau )\right]+\operatorname {E} _{\tau }\left[\ln p(\tau )\right]+C\\&=\operatorname {E} _{\tau }\left[\ln \prod _{n=1}^{N}{\mathcal {N}}\left(x_{n}\mid \mu ,\tau ^{-1}\right)\right]+\operatorname {E} _{\tau }\left[\ln {\mathcal {N}}\left(\mu \mid \mu _{0},(\lambda _{0}\tau )^{-1}\right)\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\ln \prod _{n=1}^{N}{\sqrt {\frac {\tau }{2\pi }}}e^{-{\frac {(x_{n}-\mu )^{2}\tau }{2}}}\right]+\operatorname {E} _{\tau }\left[\ln {\sqrt {\frac {\lambda _{0}\tau }{2\pi }}}e^{-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}}\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}\left({\frac {1}{2}}(\ln \tau -\ln 2\pi )-{\frac {(x_{n}-\mu )^{2}\tau }{2}}\right)\right]+\operatorname {E} _{\tau }\left[{\frac {1}{2}}(\ln \lambda _{0}+\ln \tau -\ln 2\pi )-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}-{\frac {(x_{n}-\mu )^{2}\tau }{2}}\right]+\operatorname {E} _{\tau }\left[-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}\right]+\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}{\frac {1}{2}}(\ln \tau -\ln 2\pi )\right]+\operatorname {E} _{\tau }\left[{\frac {1}{2}}(\ln \lambda _{0}+\ln \tau -\ln 2\pi )\right]+C_{2}\\&=\operatorname {E} _{\tau }\left[\sum _{n=1}^{N}-{\frac {(x_{n}-\mu )^{2}\tau }{2}}\right]+\operatorname {E} _{\tau }\left[-{\frac {(\mu -\mu _{0})^{2}\lambda _{0}\tau }{2}}\right]+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right\}+C_{3}\end{aligned}}}

In the above derivation,C{\displaystyle C},C2{\displaystyle C_{2}} andC3{\displaystyle C_{3}} refer to values that are constant with respect toμ{\displaystyle \mu }. Note that the termEτ[lnp(τ)]{\displaystyle \operatorname {E} _{\tau }[\ln p(\tau )]} is not a function ofμ{\displaystyle \mu } and will have the same value regardless of the value ofμ{\displaystyle \mu }. Hence in line 3 we can absorb it into theconstant term at the end. We do the same thing in line 7.

The last line is simply a quadratic polynomial inμ{\displaystyle \mu }. Since this is the logarithm ofqμ(μ){\displaystyle q_{\mu }^{*}(\mu )}, we can see thatqμ(μ){\displaystyle q_{\mu }^{*}(\mu )} itself is aGaussian distribution.

With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involvingμ{\displaystyle \mu } andμ2{\displaystyle \mu ^{2}} andcompleting the square overμ{\displaystyle \mu }), we can derive the parameters of the Gaussian distribution:

lnqμ(μ)=Eτ[τ]2{n=1N(xnμ)2+λ0(μμ0)2}+C3=Eτ[τ]2{n=1N(xn22xnμ+μ2)+λ0(μ22μ0μ+μ02)}+C3=Eτ[τ]2{(n=1Nxn2)2(n=1Nxn)μ+(n=1Nμ2)+λ0μ22λ0μ0μ+λ0μ02}+C3=Eτ[τ]2{(λ0+N)μ22(λ0μ0+n=1Nxn)μ+(n=1Nxn2)+λ0μ02}+C3=Eτ[τ]2{(λ0+N)μ22(λ0μ0+n=1Nxn)μ}+C4=Eτ[τ]2{(λ0+N)μ22(λ0μ0+n=1Nxnλ0+N)(λ0+N)μ}+C4=Eτ[τ]2{(λ0+N)(μ22(λ0μ0+n=1Nxnλ0+N)μ)}+C4=Eτ[τ]2{(λ0+N)(μ22(λ0μ0+n=1Nxnλ0+N)μ+(λ0μ0+n=1Nxnλ0+N)2(λ0μ0+n=1Nxnλ0+N)2)}+C4=Eτ[τ]2{(λ0+N)(μ22(λ0μ0+n=1Nxnλ0+N)μ+(λ0μ0+n=1Nxnλ0+N)2)}+C5=Eτ[τ]2{(λ0+N)(μλ0μ0+n=1Nxnλ0+N)2}+C5=12(λ0+N)Eτ[τ](μλ0μ0+n=1Nxnλ0+N)2+C5{\displaystyle {\begin{aligned}\ln q_{\mu }^{*}(\mu )&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{\sum _{n=1}^{N}(x_{n}^{2}-2x_{n}\mu +\mu ^{2})+\lambda _{0}(\mu ^{2}-2\mu _{0}\mu +\mu _{0}^{2})\right\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{\left(\sum _{n=1}^{N}x_{n}^{2}\right)-2\left(\sum _{n=1}^{N}x_{n}\right)\mu +\left(\sum _{n=1}^{N}\mu ^{2}\right)+\lambda _{0}\mu ^{2}-2\lambda _{0}\mu _{0}\mu +\lambda _{0}\mu _{0}^{2}\right\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\mu ^{2}-2\left(\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}\right)\mu +\left(\sum _{n=1}^{N}x_{n}^{2}\right)+\lambda _{0}\mu _{0}^{2}\right\}+C_{3}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\mu ^{2}-2\left(\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}\right)\mu \right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\mu ^{2}-2\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)(\lambda _{0}+N)\mu \right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu ^{2}-2\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)\mu \right)\right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu ^{2}-2\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)\mu +\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}-\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right)\right\}+C_{4}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu ^{2}-2\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)\mu +\left({\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right)\right\}+C_{5}\\&=-{\frac {\operatorname {E} _{\tau }[\tau ]}{2}}\left\{(\lambda _{0}+N)\left(\mu -{\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}\right\}+C_{5}\\&=-{\frac {1}{2}}(\lambda _{0}+N)\operatorname {E} _{\tau }[\tau ]\left(\mu -{\frac {\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}}{\lambda _{0}+N}}\right)^{2}+C_{5}\end{aligned}}}

Note that all of the above steps can be shortened by using the formula for thesum of two quadratics.

In other words:

qμ(μ)N(μμN,λN1)μN=λ0μ0+Nx¯λ0+NλN=(λ0+N)Eτ[τ]x¯=1Nn=1Nxn{\displaystyle {\begin{aligned}q_{\mu }^{*}(\mu )&\sim {\mathcal {N}}(\mu \mid \mu _{N},\lambda _{N}^{-1})\\\mu _{N}&={\frac {\lambda _{0}\mu _{0}+N{\bar {x}}}{\lambda _{0}+N}}\\\lambda _{N}&=(\lambda _{0}+N)\operatorname {E} _{\tau }[\tau ]\\{\bar {x}}&={\frac {1}{N}}\sum _{n=1}^{N}x_{n}\end{aligned}}}

Derivation ofq(τ)

[edit]

The derivation ofqτ(τ){\displaystyle q_{\tau }^{*}(\tau )} is similar to above, although we omit some of the details for the sake of brevity.

lnqτ(τ)=Eμ[lnp(Xμ,τ)+lnp(μτ)]+lnp(τ)+constant=(a01)lnτb0τ+12lnτ+N2lnττ2Eμ[n=1N(xnμ)2+λ0(μμ0)2]+constant{\displaystyle {\begin{aligned}\ln q_{\tau }^{*}(\tau )&=\operatorname {E} _{\mu }[\ln p(\mathbf {X} \mid \mu ,\tau )+\ln p(\mu \mid \tau )]+\ln p(\tau )+{\text{constant}}\\&=(a_{0}-1)\ln \tau -b_{0}\tau +{\frac {1}{2}}\ln \tau +{\frac {N}{2}}\ln \tau -{\frac {\tau }{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]+{\text{constant}}\end{aligned}}}

Exponentiating both sides, we can see thatqτ(τ){\displaystyle q_{\tau }^{*}(\tau )} is agamma distribution. Specifically:

qτ(τ)Gamma(τaN,bN)aN=a0+N+12bN=b0+12Eμ[n=1N(xnμ)2+λ0(μμ0)2]{\displaystyle {\begin{aligned}q_{\tau }^{*}(\tau )&\sim \operatorname {Gamma} (\tau \mid a_{N},b_{N})\\a_{N}&=a_{0}+{\frac {N+1}{2}}\\b_{N}&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]\end{aligned}}}

Algorithm for computing the parameters

[edit]

Let us recap the conclusions from the previous sections:

qμ(μ)N(μμN,λN1)μN=λ0μ0+Nx¯λ0+NλN=(λ0+N)Eτ[τ]x¯=1Nn=1Nxn{\displaystyle {\begin{aligned}q_{\mu }^{*}(\mu )&\sim {\mathcal {N}}(\mu \mid \mu _{N},\lambda _{N}^{-1})\\\mu _{N}&={\frac {\lambda _{0}\mu _{0}+N{\bar {x}}}{\lambda _{0}+N}}\\\lambda _{N}&=(\lambda _{0}+N)\operatorname {E} _{\tau }[\tau ]\\{\bar {x}}&={\frac {1}{N}}\sum _{n=1}^{N}x_{n}\end{aligned}}}

and

qτ(τ)Gamma(τaN,bN)aN=a0+N+12bN=b0+12Eμ[n=1N(xnμ)2+λ0(μμ0)2]{\displaystyle {\begin{aligned}q_{\tau }^{*}(\tau )&\sim \operatorname {Gamma} (\tau \mid a_{N},b_{N})\\a_{N}&=a_{0}+{\frac {N+1}{2}}\\b_{N}&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]\end{aligned}}}

In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable. We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions:

E[τaN,bN]=aNbNE[μμN,λN1]=μNE[X2]=Var(X)+(E[X])2E[μ2μN,λN1]=λN1+μN2{\displaystyle {\begin{aligned}\operatorname {E} [\tau \mid a_{N},b_{N}]&={\frac {a_{N}}{b_{N}}}\\\operatorname {E} \left[\mu \mid \mu _{N},\lambda _{N}^{-1}\right]&=\mu _{N}\\\operatorname {E} \left[X^{2}\right]&=\operatorname {Var} (X)+(\operatorname {E} [X])^{2}\\\operatorname {E} \left[\mu ^{2}\mid \mu _{N},\lambda _{N}^{-1}\right]&=\lambda _{N}^{-1}+\mu _{N}^{2}\end{aligned}}}

Applying these formulas to the above equations is trivial in most cases, but the equation forbN{\displaystyle b_{N}} takes more work:

bN=b0+12Eμ[n=1N(xnμ)2+λ0(μμ0)2]=b0+12Eμ[(λ0+N)μ22(λ0μ0+n=1Nxn)μ+(n=1Nxn2)+λ0μ02]=b0+12[(λ0+N)Eμ[μ2]2(λ0μ0+n=1Nxn)Eμ[μ]+(n=1Nxn2)+λ0μ02]=b0+12[(λ0+N)(λN1+μN2)2(λ0μ0+n=1Nxn)μN+(n=1Nxn2)+λ0μ02]{\displaystyle {\begin{aligned}b_{N}&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[\sum _{n=1}^{N}(x_{n}-\mu )^{2}+\lambda _{0}(\mu -\mu _{0})^{2}\right]\\&=b_{0}+{\frac {1}{2}}\operatorname {E} _{\mu }\left[(\lambda _{0}+N)\mu ^{2}-2\left(\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}\right)\mu +\left(\sum _{n=1}^{N}x_{n}^{2}\right)+\lambda _{0}\mu _{0}^{2}\right]\\&=b_{0}+{\frac {1}{2}}\left[(\lambda _{0}+N)\operatorname {E} _{\mu }[\mu ^{2}]-2\left(\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}\right)\operatorname {E} _{\mu }[\mu ]+\left(\sum _{n=1}^{N}x_{n}^{2}\right)+\lambda _{0}\mu _{0}^{2}\right]\\&=b_{0}+{\frac {1}{2}}\left[(\lambda _{0}+N)\left(\lambda _{N}^{-1}+\mu _{N}^{2}\right)-2\left(\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}\right)\mu _{N}+\left(\sum _{n=1}^{N}x_{n}^{2}\right)+\lambda _{0}\mu _{0}^{2}\right]\\\end{aligned}}}

We can then write the parameter equations as follows, without any expectations:

μN=λ0μ0+Nx¯λ0+NλN=(λ0+N)aNbNx¯=1Nn=1NxnaN=a0+N+12bN=b0+12[(λ0+N)(λN1+μN2)2(λ0μ0+n=1Nxn)μN+(n=1Nxn2)+λ0μ02]{\displaystyle {\begin{aligned}\mu _{N}&={\frac {\lambda _{0}\mu _{0}+N{\bar {x}}}{\lambda _{0}+N}}\\\lambda _{N}&=(\lambda _{0}+N){\frac {a_{N}}{b_{N}}}\\{\bar {x}}&={\frac {1}{N}}\sum _{n=1}^{N}x_{n}\\a_{N}&=a_{0}+{\frac {N+1}{2}}\\b_{N}&=b_{0}+{\frac {1}{2}}\left[(\lambda _{0}+N)\left(\lambda _{N}^{-1}+\mu _{N}^{2}\right)-2\left(\lambda _{0}\mu _{0}+\sum _{n=1}^{N}x_{n}\right)\mu _{N}+\left(\sum _{n=1}^{N}x_{n}^{2}\right)+\lambda _{0}\mu _{0}^{2}\right]\end{aligned}}}

Note that there are circular dependencies among the formulas forλN{\displaystyle \lambda _{N}}andbN{\displaystyle b_{N}}. This naturally suggests anEM-like algorithm:

  1. Computen=1Nxn{\displaystyle \sum _{n=1}^{N}x_{n}} andn=1Nxn2.{\displaystyle \sum _{n=1}^{N}x_{n}^{2}.} Use these values to computeμN{\displaystyle \mu _{N}} andaN.{\displaystyle a_{N}.}
  2. InitializeλN{\displaystyle \lambda _{N}} to some arbitrary value.
  3. Use the current value ofλN,{\displaystyle \lambda _{N},} along with the known values of the other parameters, to computebN{\displaystyle b_{N}}.
  4. Use the current value ofbN,{\displaystyle b_{N},} along with the known values of the other parameters, to computeλN{\displaystyle \lambda _{N}}.
  5. Repeat the last two steps until convergence (i.e. until neither value has changed more than some small amount).

We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc.

It can be shown that this algorithm is guaranteed to converge to a local maximum.

Note also that the posterior distributions have the same form as the corresponding prior distributions. We didnot assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally. It turns out (see below) that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of theexponential family, which is the case for most of the standard distributions.

Further discussion

[edit]

Step-by-step recipe

[edit]

The above example shows the method by which the variational-Bayesian approximation to aposterior probability density in a givenBayesian network is derived:

  1. Describe the network with agraphical model, identifying the observed variables (data)X{\displaystyle \mathbf {X} } and unobserved variables (parametersΘ{\displaystyle {\boldsymbol {\Theta }}} andlatent variablesZ{\displaystyle \mathbf {Z} }) and theirconditional probability distributions. Variational Bayes will then construct an approximation to the posterior probabilityp(Z,ΘX){\displaystyle p(\mathbf {Z} ,{\boldsymbol {\Theta }}\mid \mathbf {X} )}. The approximation has the basic property that it is a factorized distribution, i.e. a product of two or moreindependent distributions over disjoint subsets of the unobserved variables.
  2. Partition the unobserved variables into two or more subsets, over which the independent factors will be derived. There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable. Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result. Assume that the partitions are calledZ1,,ZM{\displaystyle \mathbf {Z} _{1},\ldots ,\mathbf {Z} _{M}}.
  3. For a given partitionZj{\displaystyle \mathbf {Z} _{j}}, write down the formula for the best approximating distributionqj(ZjX){\displaystyle q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )} using the basic equationlnqj(ZjX)=Eij[lnp(Z,X)]+constant{\displaystyle \ln q_{j}^{*}(\mathbf {Z} _{j}\mid \mathbf {X} )=\operatorname {E} _{i\neq j}[\ln p(\mathbf {Z} ,\mathbf {X} )]+{\text{constant}}} .
  4. Fill in the formula for thejoint probability distribution using the graphical model. Any component conditional distributions that don't involve any of the variables inZj{\displaystyle \mathbf {Z} _{j}} can be ignored; they will be folded into the constant term.
  5. Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not inZj{\displaystyle \mathbf {Z} _{j}} (e.g. first or second rawmoments, expectation of a logarithm, etc.). In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/orhyperparameters of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition.
  6. The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates theprobability density function (PDF) of the distribution (or at least, something proportional to it, with unknownnormalization constant). In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula.
  7. When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions.
  8. When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters.
  9. Anexpectation–maximization (EM) type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn. This is guaranteed to converge.

Most important points

[edit]

Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are:

  1. The idea of variational Bayes is to construct an analytical approximation to theposterior probability of the set of unobserved variables (parameters and latent variables), given the data. This means that the form of the solution is similar to otherBayesian inference methods, such asGibbs sampling — i.e. a distribution that seeks to describe everything that is known about the variables. As in other Bayesian methods — but unlike e.g. inexpectation–maximization (EM) or othermaximum likelihood methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. asrandom variables. Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a singlepoint estimate or deriving acredible interval, highest density region, etc.
  2. "Analytical approximation" means that a formula can be written down for the posterior distribution. The formula generally consists of a product of well-known probability distributions, each of whichfactorizes over a set of unobserved variables (i.e. it isconditionally independent of the other variables, given the observed data). This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowestmoments of the unobserved variables, e.g. themean andvariance.
  3. The result of all of the mathematical manipulations is (1) the identity of the probability distributions making up the factors, and (2) mutually dependent formulas for the parameters of these distributions. The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.

Compared with expectation–maximization (EM)

[edit]

Variational Bayes (VB) is often compared withexpectation–maximization (EM). The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values. The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations.

However, there are a number of differences. Most important iswhat is being computed.

  • EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but only estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete). The point estimates computed are themodes of these parameters; no other information is available.
  • VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally themean is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, the parameters computed in VB donot have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussianmixture model will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of aprior distribution, e.g. anormal-scaled inverse gamma distribution — and would then compute values for the parameters of this prior distribution, i.e. essentiallyhyperparameters. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.

A more complex example

[edit]
Bayesian Gaussian mixture model usingplate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of sizeK; [D,D] means a matrix of sizeD×D;K alone means acategorical variable withK outcomes. The squiggly line coming fromz ending in a crossbar indicates aswitch — the value of this variable selects, for the other incoming variables, which value to use out of the size-K array of possible values.

Imagine a BayesianGaussian mixture model described as follows:[3]

πSymDir(K,α0)Λi=1KW(W0,ν0)μi=1KN(μ0,(β0Λi)1)z[i=1N]Mult(1,π)xi=1NN(μzi,Λzi1)K=number of mixing componentsN=number of data points{\displaystyle {\begin{aligned}\mathbf {\pi } &\sim \operatorname {SymDir} (K,\alpha _{0})\\\mathbf {\Lambda } _{i=1\dots K}&\sim {\mathcal {W}}(\mathbf {W} _{0},\nu _{0})\\\mathbf {\mu } _{i=1\dots K}&\sim {\mathcal {N}}(\mathbf {\mu } _{0},(\beta _{0}\mathbf {\Lambda } _{i})^{-1})\\\mathbf {z} [i=1\dots N]&\sim \operatorname {Mult} (1,\mathbf {\pi } )\\\mathbf {x} _{i=1\dots N}&\sim {\mathcal {N}}(\mathbf {\mu } _{z_{i}},{\mathbf {\Lambda } _{z_{i}}}^{-1})\\K&={\text{number of mixing components}}\\N&={\text{number of data points}}\end{aligned}}}

Note:

The interpretation of the above variables is as follows:

The joint probability of all variables can be rewritten as

p(X,Z,π,μ,Λ)=p(XZ,μ,Λ)p(Zπ)p(π)p(μΛ)p(Λ){\displaystyle p(\mathbf {X} ,\mathbf {Z} ,\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )=p(\mathbf {X} \mid \mathbf {Z} ,\mathbf {\mu } ,\mathbf {\Lambda } )p(\mathbf {Z} \mid \mathbf {\pi } )p(\mathbf {\pi } )p(\mathbf {\mu } \mid \mathbf {\Lambda } )p(\mathbf {\Lambda } )}

where the individual factors are

p(XZ,μ,Λ)=n=1Nk=1KN(xnμk,Λk1)znkp(Zπ)=n=1Nk=1Kπkznkp(π)=Γ(Kα0)Γ(α0)Kk=1Kπkα01p(μΛ)=k=1KN(μkμ0,(β0Λk)1)p(Λ)=k=1KW(ΛkW0,ν0){\displaystyle {\begin{aligned}p(\mathbf {X} \mid \mathbf {Z} ,\mathbf {\mu } ,\mathbf {\Lambda } )&=\prod _{n=1}^{N}\prod _{k=1}^{K}{\mathcal {N}}(\mathbf {x} _{n}\mid \mathbf {\mu } _{k},\mathbf {\Lambda } _{k}^{-1})^{z_{nk}}\\p(\mathbf {Z} \mid \mathbf {\pi } )&=\prod _{n=1}^{N}\prod _{k=1}^{K}\pi _{k}^{z_{nk}}\\p(\mathbf {\pi } )&={\frac {\Gamma (K\alpha _{0})}{\Gamma (\alpha _{0})^{K}}}\prod _{k=1}^{K}\pi _{k}^{\alpha _{0}-1}\\p(\mathbf {\mu } \mid \mathbf {\Lambda } )&=\prod _{k=1}^{K}{\mathcal {N}}(\mathbf {\mu } _{k}\mid \mathbf {\mu } _{0},(\beta _{0}\mathbf {\Lambda } _{k})^{-1})\\p(\mathbf {\Lambda } )&=\prod _{k=1}^{K}{\mathcal {W}}(\mathbf {\Lambda } _{k}\mid \mathbf {W} _{0},\nu _{0})\end{aligned}}}

where

N(xμ,Σ)=1(2π)D/21|Σ|1/2exp{12(xμ)TΣ1(xμ)}W(ΛW,ν)=B(W,ν)|Λ|(νD1)/2exp(12Tr(W1Λ))B(W,ν)=|W|ν/2{2νD/2πD(D1)/4i=1DΓ(ν+1i2)}1D=dimensionality of each data point{\displaystyle {\begin{aligned}{\mathcal {N}}(\mathbf {x} \mid \mathbf {\mu } ,\mathbf {\Sigma } )&={\frac {1}{(2\pi )^{D/2}}}{\frac {1}{|\mathbf {\Sigma } |^{1/2}}}\exp \left\{-{\frac {1}{2}}(\mathbf {x} -\mathbf {\mu } )^{\rm {T}}\mathbf {\Sigma } ^{-1}(\mathbf {x} -\mathbf {\mu } )\right\}\\{\mathcal {W}}(\mathbf {\Lambda } \mid \mathbf {W} ,\nu )&=B(\mathbf {W} ,\nu )|\mathbf {\Lambda } |^{(\nu -D-1)/2}\exp \left(-{\frac {1}{2}}\operatorname {Tr} (\mathbf {W} ^{-1}\mathbf {\Lambda } )\right)\\B(\mathbf {W} ,\nu )&=|\mathbf {W} |^{-\nu /2}\left\{2^{\nu D/2}\pi ^{D(D-1)/4}\prod _{i=1}^{D}\Gamma \left({\frac {\nu +1-i}{2}}\right)\right\}^{-1}\\D&={\text{dimensionality of each data point}}\end{aligned}}}

Assume thatq(Z,π,μ,Λ)=q(Z)q(π,μ,Λ){\displaystyle q(\mathbf {Z} ,\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )=q(\mathbf {Z} )q(\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )}.

Then[3]

lnq(Z)=Eπ,μ,Λ[lnp(X,Z,π,μ,Λ)]+constant=Eπ[lnp(Zπ)]+Eμ,Λ[lnp(XZ,μ,Λ)]+constant=n=1Nk=1Kznklnρnk+constant{\displaystyle {\begin{aligned}\ln q^{*}(\mathbf {Z} )&=\operatorname {E} _{\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } }[\ln p(\mathbf {X} ,\mathbf {Z} ,\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )]+{\text{constant}}\\&=\operatorname {E} _{\mathbf {\pi } }[\ln p(\mathbf {Z} \mid \mathbf {\pi } )]+\operatorname {E} _{\mathbf {\mu } ,\mathbf {\Lambda } }[\ln p(\mathbf {X} \mid \mathbf {Z} ,\mathbf {\mu } ,\mathbf {\Lambda } )]+{\text{constant}}\\&=\sum _{n=1}^{N}\sum _{k=1}^{K}z_{nk}\ln \rho _{nk}+{\text{constant}}\end{aligned}}}

where we have defined

lnρnk=E[lnπk]+12E[ln|Λk|]D2ln(2π)12Eμk,Λk[(xnμk)TΛk(xnμk)]{\displaystyle \ln \rho _{nk}=\operatorname {E} [\ln \pi _{k}]+{\frac {1}{2}}\operatorname {E} [\ln |\mathbf {\Lambda } _{k}|]-{\frac {D}{2}}\ln(2\pi )-{\frac {1}{2}}\operatorname {E} _{\mathbf {\mu } _{k},\mathbf {\Lambda } _{k}}[(\mathbf {x} _{n}-\mathbf {\mu } _{k})^{\rm {T}}\mathbf {\Lambda } _{k}(\mathbf {x} _{n}-\mathbf {\mu } _{k})]}

Exponentiating both sides of the formula forlnq(Z){\displaystyle \ln q^{*}(\mathbf {Z} )} yields

q(Z)n=1Nk=1Kρnkznk{\displaystyle q^{*}(\mathbf {Z} )\propto \prod _{n=1}^{N}\prod _{k=1}^{K}\rho _{nk}^{z_{nk}}}

Requiring that this be normalized ends up requiring that theρnk{\displaystyle \rho _{nk}} sum to 1 over all values ofk{\displaystyle k}, yielding

q(Z)=n=1Nk=1Krnkznk{\displaystyle q^{*}(\mathbf {Z} )=\prod _{n=1}^{N}\prod _{k=1}^{K}r_{nk}^{z_{nk}}}

where

rnk=ρnkj=1Kρnj{\displaystyle r_{nk}={\frac {\rho _{nk}}{\sum _{j=1}^{K}\rho _{nj}}}}

In other words,q(Z){\displaystyle q^{*}(\mathbf {Z} )} is a product of single-observationmultinomial distributions, and factors over each individualzn{\displaystyle \mathbf {z} _{n}}, which is distributed as a single-observation multinomial distribution with parametersrnk{\displaystyle r_{nk}} fork=1K{\displaystyle k=1\dots K}.

Furthermore, we note that

E[znk]=rnk{\displaystyle \operatorname {E} [z_{nk}]=r_{nk}\,}

which is a standard result for categorical distributions.

Now, considering the factorq(π,μ,Λ){\displaystyle q(\mathbf {\pi } ,\mathbf {\mu } ,\mathbf {\Lambda } )}, note that it automatically factors intoq(π)k=1Kq(μk,Λk){\displaystyle q(\mathbf {\pi } )\prod _{k=1}^{K}q(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})} due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.

Then,

lnq(π)=lnp(π)+EZ[lnp(Zπ)]+constant=(α01)k=1Klnπk+n=1Nk=1Krnklnπk+constant{\displaystyle {\begin{aligned}\ln q^{*}(\mathbf {\pi } )&=\ln p(\mathbf {\pi } )+\operatorname {E} _{\mathbf {Z} }[\ln p(\mathbf {Z} \mid \mathbf {\pi } )]+{\text{constant}}\\&=(\alpha _{0}-1)\sum _{k=1}^{K}\ln \pi _{k}+\sum _{n=1}^{N}\sum _{k=1}^{K}r_{nk}\ln \pi _{k}+{\text{constant}}\end{aligned}}}

Taking the exponential of both sides, we recognizeq(π){\displaystyle q^{*}(\mathbf {\pi } )} as aDirichlet distribution

q(π)Dir(α){\displaystyle q^{*}(\mathbf {\pi } )\sim \operatorname {Dir} (\mathbf {\alpha } )\,}

where

αk=α0+Nk{\displaystyle \alpha _{k}=\alpha _{0}+N_{k}\,}

where

Nk=n=1Nrnk{\displaystyle N_{k}=\sum _{n=1}^{N}r_{nk}\,}

Finally

lnq(μk,Λk)=lnp(μk,Λk)+n=1NE[znk]lnN(xnμk,Λk1)+constant{\displaystyle \ln q^{*}(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})=\ln p(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})+\sum _{n=1}^{N}\operatorname {E} [z_{nk}]\ln {\mathcal {N}}(\mathbf {x} _{n}\mid \mathbf {\mu } _{k},\mathbf {\Lambda } _{k}^{-1})+{\text{constant}}}

Grouping and reading off terms involvingμk{\displaystyle \mathbf {\mu } _{k}} andΛk{\displaystyle \mathbf {\Lambda } _{k}}, the result is aGaussian-Wishart distribution given by

q(μk,Λk)=N(μkmk,(βkΛk)1)W(ΛkWk,νk){\displaystyle q^{*}(\mathbf {\mu } _{k},\mathbf {\Lambda } _{k})={\mathcal {N}}(\mathbf {\mu } _{k}\mid \mathbf {m} _{k},(\beta _{k}\mathbf {\Lambda } _{k})^{-1}){\mathcal {W}}(\mathbf {\Lambda } _{k}\mid \mathbf {W} _{k},\nu _{k})}

given the definitions

βk=β0+Nkmk=1βk(β0μ0+Nkx¯k)Wk1=W01+NkSk+β0Nkβ0+Nk(x¯kμ0)(x¯kμ0)Tνk=ν0+NkNk=n=1Nrnkx¯k=1Nkn=1NrnkxnSk=1Nkn=1Nrnk(xnx¯k)(xnx¯k)T{\displaystyle {\begin{aligned}\beta _{k}&=\beta _{0}+N_{k}\\\mathbf {m} _{k}&={\frac {1}{\beta _{k}}}(\beta _{0}\mathbf {\mu } _{0}+N_{k}{\bar {\mathbf {x} }}_{k})\\\mathbf {W} _{k}^{-1}&=\mathbf {W} _{0}^{-1}+N_{k}\mathbf {S} _{k}+{\frac {\beta _{0}N_{k}}{\beta _{0}+N_{k}}}({\bar {\mathbf {x} }}_{k}-\mathbf {\mu } _{0})({\bar {\mathbf {x} }}_{k}-\mathbf {\mu } _{0})^{\rm {T}}\\\nu _{k}&=\nu _{0}+N_{k}\\N_{k}&=\sum _{n=1}^{N}r_{nk}\\{\bar {\mathbf {x} }}_{k}&={\frac {1}{N_{k}}}\sum _{n=1}^{N}r_{nk}\mathbf {x} _{n}\\\mathbf {S} _{k}&={\frac {1}{N_{k}}}\sum _{n=1}^{N}r_{nk}(\mathbf {x} _{n}-{\bar {\mathbf {x} }}_{k})(\mathbf {x} _{n}-{\bar {\mathbf {x} }}_{k})^{\rm {T}}\end{aligned}}}

Finally, notice that these functions require the values ofrnk{\displaystyle r_{nk}}, which make use ofρnk{\displaystyle \rho _{nk}}, which is defined in turn based onE[lnπk]{\displaystyle \operatorname {E} [\ln \pi _{k}]},E[ln|Λk|]{\displaystyle \operatorname {E} [\ln |\mathbf {\Lambda } _{k}|]}, andEμk,Λk[(xnμk)TΛk(xnμk)]{\displaystyle \operatorname {E} _{\mathbf {\mu } _{k},\mathbf {\Lambda } _{k}}[(\mathbf {x} _{n}-\mathbf {\mu } _{k})^{\rm {T}}\mathbf {\Lambda } _{k}(\mathbf {x} _{n}-\mathbf {\mu } _{k})]}. Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:

Eμk,Λk[(xnμk)TΛk(xnμk)]=Dβk1+νk(xnmk)TWk(xnmk)lnΛ~kE[ln|Λk|]=i=1Dψ(νk+1i2)+Dln2+ln|Wk|lnπ~kE[ln|πk|]=ψ(αk)ψ(i=1Kαi){\displaystyle {\begin{aligned}\operatorname {E} _{\mathbf {\mu } _{k},\mathbf {\Lambda } _{k}}[(\mathbf {x} _{n}-\mathbf {\mu } _{k})^{\rm {T}}\mathbf {\Lambda } _{k}(\mathbf {x} _{n}-\mathbf {\mu } _{k})]&=D\beta _{k}^{-1}+\nu _{k}(\mathbf {x} _{n}-\mathbf {m} _{k})^{\rm {T}}\mathbf {W} _{k}(\mathbf {x} _{n}-\mathbf {m} _{k})\\\ln {\widetilde {\Lambda }}_{k}&\equiv \operatorname {E} [\ln |\mathbf {\Lambda } _{k}|]=\sum _{i=1}^{D}\psi \left({\frac {\nu _{k}+1-i}{2}}\right)+D\ln 2+\ln |\mathbf {W} _{k}|\\\ln {\widetilde {\pi }}_{k}&\equiv \operatorname {E} \left[\ln |\pi _{k}|\right]=\psi (\alpha _{k})-\psi \left(\sum _{i=1}^{K}\alpha _{i}\right)\end{aligned}}}

These results lead to

rnkπ~kΛ~k1/2exp{D2βkνk2(xnmk)TWk(xnmk)}{\displaystyle r_{nk}\propto {\widetilde {\pi }}_{k}{\widetilde {\Lambda }}_{k}^{1/2}\exp \left\{-{\frac {D}{2\beta _{k}}}-{\frac {\nu _{k}}{2}}(\mathbf {x} _{n}-\mathbf {m} _{k})^{\rm {T}}\mathbf {W} _{k}(\mathbf {x} _{n}-\mathbf {m} _{k})\right\}}

These can be converted from proportional to absolute values by normalizing overk{\displaystyle k} so that the corresponding values sum to 1.

Note that:

  1. The update equations for the parametersβk{\displaystyle \beta _{k}},mk{\displaystyle \mathbf {m} _{k}},Wk{\displaystyle \mathbf {W} _{k}} andνk{\displaystyle \nu _{k}} of the variablesμk{\displaystyle \mathbf {\mu } _{k}} andΛk{\displaystyle \mathbf {\Lambda } _{k}} depend on the statisticsNk{\displaystyle N_{k}},x¯k{\displaystyle {\bar {\mathbf {x} }}_{k}}, andSk{\displaystyle \mathbf {S} _{k}}, and these statistics in turn depend onrnk{\displaystyle r_{nk}}.
  2. The update equations for the parametersα1K{\displaystyle \alpha _{1\dots K}} of the variableπ{\displaystyle \mathbf {\pi } } depend on the statisticNk{\displaystyle N_{k}}, which depends in turn onrnk{\displaystyle r_{nk}}.
  3. The update equation forrnk{\displaystyle r_{nk}} has a direct circular dependence onβk{\displaystyle \beta _{k}},mk{\displaystyle \mathbf {m} _{k}},Wk{\displaystyle \mathbf {W} _{k}} andνk{\displaystyle \nu _{k}} as well as an indirect circular dependence onWk{\displaystyle \mathbf {W} _{k}},νk{\displaystyle \nu _{k}} andα1K{\displaystyle \alpha _{1\dots K}} throughπ~k{\displaystyle {\widetilde {\pi }}_{k}} andΛ~k{\displaystyle {\widetilde {\Lambda }}_{k}}.

This suggests an iterative procedure that alternates between two steps:

  1. An E-step that computes the value ofrnk{\displaystyle r_{nk}} using the current values of all the other parameters.
  2. An M-step that uses the new value ofrnk{\displaystyle r_{nk}} to compute new values of all the other parameters.

Note that these steps correspond closely with the standard EM algorithm to derive amaximum likelihood ormaximum a posteriori (MAP) solution for the parameters of aGaussian mixture model. The responsibilitiesrnk{\displaystyle r_{nk}} in the E step correspond closely to theposterior probabilities of the latent variables given the data, i.e.p(ZX){\displaystyle p(\mathbf {Z} \mid \mathbf {X} )}; the computation of the statisticsNk{\displaystyle N_{k}},x¯k{\displaystyle {\bar {\mathbf {x} }}_{k}}, andSk{\displaystyle \mathbf {S} _{k}} corresponds closely to the computation of corresponding "soft-count" statistics over the data; and the use of those statistics to compute new values of the parameters corresponds closely to the use of soft counts to compute new parameter values in normal EM over a Gaussian mixture model.

Exponential-family distributions

[edit]

Note that in the previous example, once the distribution over unobserved variables was assumed to factorize into distributions over the "parameters" and distributions over the "latent data", the derived "best" distribution for each variable was in the same family as the corresponding prior distribution over the variable. This is a general result that holds true for all prior distributions derived from theexponential family.

See also

[edit]

References

[edit]
  1. ^abcdTran, Viet Hung (2018). "Copula Variational Bayes inference via information geometry".arXiv:1803.10998 [cs.IT].
  2. ^abAdamčík, Martin (2014)."The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning".Entropy.16 (12):6338–6381.Bibcode:2014Entrp..16.6338A.doi:10.3390/e16126338.
  3. ^abcNguyen, Duy (15 August 2023)."An in Depth Introduction to Variational Bayes Note".doi:10.2139/ssrn.4541076.SSRN 4541076. Retrieved15 August 2023.
  4. ^abcLee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review".Communications in Statistics - Theory and Methods.51 (6):1–21.arXiv:2008.01006.doi:10.1080/03610926.2021.1921214.S2CID 220935477.
  5. ^Boyd, Stephen P.; Vandenberghe, Lieven (2004).Convex Optimization(PDF). Cambridge University Press.ISBN 978-0-521-83378-3. RetrievedOctober 15, 2011.
  6. ^Bishop, Christopher M. (2006). "Chapter 10".Pattern Recognition and Machine Learning. Springer.ISBN 978-0-387-31073-2.
  7. ^Sotirios P. Chatzis, “Infinite Markov-Switching Maximum Entropy Discrimination Machines,” Proc. 30th International Conference on Machine Learning (ICML). Journal of Machine Learning Research: Workshop and Conference Proceedings, vol. 28, no. 3, pp. 729–737, June 2013.

External links

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Variational_Bayesian_methods&oldid=1318931229"
Category:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp