Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Dirichlet-multinomial distribution

From Wikipedia, the free encyclopedia
Distributions in probability theory
Dirichlet-Multinomial
NotationDirMult(n,α){\displaystyle \mathrm {DirMult} (n,{\boldsymbol {\alpha }})}
Parametersn{0,1,2,}{\displaystyle n\in \{0,1,2,\ldots \}} number of trials
α1,,αK>0,α0=αk{\displaystyle \alpha _{1},\ldots ,\alpha _{K}>0,\alpha _{0}=\sum \alpha _{k}}
Supportxi{0,,n}{\displaystyle x_{i}\in \{0,\dots ,n\}}
Σxi=n,1iK{\displaystyle \Sigma x_{i}=n\!,1\leq i\leq K}
PMFΓ(α0)Γ(n+1)Γ(n+α0)k=1KΓ(xk+αk)Γ(αk)Γ(xk+1){\displaystyle {\frac {\Gamma \left(\alpha _{0}\right)\Gamma \left(n+1\right)}{\Gamma \left(n+\alpha _{0}\right)}}\prod _{k=1}^{K}{\frac {\Gamma (x_{k}+\alpha _{k})}{\Gamma (\alpha _{k})\Gamma \left(x_{k}+1\right)}}}[1]
MeanE(Xi)=nαiα0{\displaystyle \operatorname {E} (X_{i})=n{\frac {\alpha _{i}}{\alpha _{0}}}}
VarianceVar(Xi)=nαiα0(1αiα0)(n+α01+α0){\displaystyle \operatorname {Var} (X_{i})=n{\frac {\alpha _{i}}{\alpha _{0}}}\left(1-{\frac {\alpha _{i}}{\alpha _{0}}}\right)\left({\frac {n+\alpha _{0}}{1+\alpha _{0}}}\right)}
Cov(Xi,Xk)=nαiα0αkα0(n+α01+α0)  (ik){\displaystyle \textstyle {\mathrm {Cov} }(X_{i},X_{k})=-n{\frac {\alpha _{i}}{\alpha _{0}}}{\frac {\alpha _{k}}{\alpha _{0}}}\left({\frac {n+\alpha _{0}}{1+\alpha _{0}}}\right)~~(i\neq k)}
MGFE(k=1Ketkxk)=Γ(α0)Γ(n+1)Γ(n+α0)Dn(α,(et1,...,etK)){\displaystyle \operatorname {E} (\prod \limits _{k=1}^{K}{e}^{t_{k}\cdot x_{k}})={\frac {\Gamma (\alpha _{0})\Gamma (n+1)}{\Gamma (n+\alpha _{0})}}\cdot D_{n}({\boldsymbol {\alpha }},(e^{t_{1}},...,e^{t_{K}}))}
with
Dn=1nu=1n[(k=1Kαketku)Dnu],D0=1{\displaystyle D_{n}={\frac {1}{n}}\sum \limits _{u=1}^{n}\left[\left(\sum \limits _{k=1}^{K}\alpha _{k}\cdot {e}^{t_{k}\cdot u}\right)D_{n-u}\right],D_{0}=1}[1]
CF

E(k=1Keitkxk)=Γ(α0)Γ(n+1)Γ(n+α0)Dn(α,(eit1,...,eitK)){\displaystyle \operatorname {E} (\prod \limits _{k=1}^{K}{e}^{it_{k}\cdot x_{k}})={\frac {\Gamma (\alpha _{0})\Gamma (n+1)}{\Gamma (n+\alpha _{0})}}\cdot D_{n}({\boldsymbol {\alpha }},(e^{it_{1}},...,e^{it_{K}}))}
with

Dn=1nu=1n[(k=1Kαkeitku)Dnu],D0=1{\displaystyle D_{n}={\frac {1}{n}}\sum \limits _{u=1}^{n}\left[\left(\sum \limits _{k=1}^{K}\alpha _{k}\cdot {e}^{it_{k}\cdot u}\right)D_{n-u}\right],D_{0}=1}[1]
PGF

E(k=1Kzkxk)=Γ(α0)Γ(n+1)Γ(n+α0)Dn(α,z){\displaystyle \operatorname {E} (\prod \limits _{k=1}^{K}{z_{k}}^{x_{k}})={\frac {\Gamma (\alpha _{0})\Gamma (n+1)}{\Gamma (n+\alpha _{0})}}\cdot D_{n}({\boldsymbol {\alpha }},\mathbf {z} )}
with

Dn=1nu=1n[(k=1Kαkzku)Dnu],D0=1{\displaystyle D_{n}={\frac {1}{n}}\sum \limits _{u=1}^{n}\left[\left(\sum \limits _{k=1}^{K}\alpha _{k}\cdot {z_{k}}^{u}\right)D_{n-u}\right],D_{0}=1}[1]

Inprobability theory andstatistics, theDirichlet-multinomial distribution is a family of discrete multivariateprobability distributions on a finite support of non-negative integers. It is also called theDirichlet compound multinomial distribution (DCM) ormultivariate Pólya distribution (afterGeorge Pólya). It is acompound probability distribution, where a probability vectorp is drawn from aDirichlet distribution with parameter vectorα{\displaystyle {\boldsymbol {\alpha }}}, and an observation drawn from amultinomial distribution with probability vectorp and number of trialsn. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to aPólya urn scheme. It is frequently encountered inBayesian statistics,machine learning,empirical Bayes methods andclassical statistics as anoverdispersedmultinomial distribution.

It reduces to thecategorical distribution as a special case whenn = 1. It also approximates themultinomial distribution arbitrarily well for largeα. The Dirichlet-multinomial is a multivariate extension of thebeta-binomial distribution, as the multinomial and Dirichlet distributions are multivariate versions of thebinomial distribution andbeta distributions, respectively.

Specification

[edit]

Dirichlet-multinomial as a compound distribution

[edit]

The Dirichlet distribution is aconjugate distribution to the multinomial distribution. This fact leads to an analytically tractablecompound distribution.For a random vector of category countsx=(x1,,xK){\displaystyle \mathbf {x} =(x_{1},\dots ,x_{K})}, distributed according to amultinomial distribution, themarginal distribution is obtained by integrating on the distribution forp which can be thought of as arandom vector following a Dirichlet distribution:

Pr(xn,α)=pMult(xn,p)Dir(pα)dp{\displaystyle \Pr(\mathbf {x} \mid n,{\boldsymbol {\alpha }})=\int _{\mathbf {p} }\mathrm {Mult} (\mathbf {x} \mid n,\mathbf {p} )\mathrm {Dir} (\mathbf {p} \mid {\boldsymbol {\alpha }}){\textrm {d}}\mathbf {p} }

which results in the following explicit formula:

Pr(xn,α)=Γ(α0)Γ(n+1)Γ(n+α0)k=1KΓ(xk+αk)Γ(αk)Γ(xk+1){\displaystyle \Pr(\mathbf {x} \mid n,{\boldsymbol {\alpha }})={\frac {\Gamma \left(\alpha _{0}\right)\Gamma \left(n+1\right)}{\Gamma \left(n+\alpha _{0}\right)}}\prod _{k=1}^{K}{\frac {\Gamma (x_{k}+\alpha _{k})}{\Gamma (\alpha _{k})\Gamma \left(x_{k}+1\right)}}}

whereα0{\displaystyle \alpha _{0}} is defined as the sumα0=αk{\displaystyle \alpha _{0}=\sum \alpha _{k}}. Another form for this same compound distribution, written more compactly in terms of thebeta function,B, is as follows:

Pr(xn,α)=nB(α0,n)k:xk>0xkB(αk,xk).{\displaystyle \Pr(\mathbf {x} \mid n,{\boldsymbol {\alpha }})={\frac {nB\left(\alpha _{0},n\right)}{\prod _{k:x_{k}>0}x_{k}B\left(\alpha _{k},x_{k}\right)}}.}

The latter form emphasizes the fact that zero count categories can be ignored in the calculation - a useful fact when the number of categories is very large andsparse (e.g. word counts in documents).

Observe that the pdf is the Beta-binomial distribution whenK=2{\displaystyle K=2}. It can also be shown that it approaches the multinomial distribution asα0{\displaystyle \alpha _{0}} approaches infinity. The parameterα0{\displaystyle \alpha _{0}} governs the degree of overdispersion orburstiness relative to the multinomial. Alternative choices to denoteα0{\displaystyle \alpha _{0}} found in the literature are S and A.

Dirichlet-multinomial as an urn model

[edit]

The Dirichlet-multinomial distribution can also be motivated via anurn model for positiveinteger values of the vectorα{\displaystyle {\boldsymbol {\alpha }}}, known as thePolya urn model. Specifically, imagine an urn containing balls ofK{\displaystyle K} colors numberingαi{\displaystyle \alpha _{i}} for the ith color, where random draws are made. When a ball is randomly drawn and observed, then two balls of the same color are returned to the urn. If this is performedn{\displaystyle n} times, then the probability of observing the random vectorx{\displaystyle x} of color counts is a Dirichlet-multinomial with parametersn{\displaystyle n} andα{\displaystyle {\boldsymbol {\alpha }}}.If the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a multinomial distribution and if the random draws are made without replacement, the distribution follows amultivariate hypergeometric distribution.

Properties

[edit]

Moments

[edit]

Once again, letα0=αk{\displaystyle \alpha _{0}=\sum \alpha _{k}} and letpi=αiαk=αiα0{\displaystyle p_{i}={\frac {\alpha _{i}}{\sum \alpha _{k}}}={\frac {\alpha _{i}}{\alpha _{0}}}}, then theexpected number of times the outcomei was observed overn trials is

E(Xi)=npi=nαiα0.{\displaystyle \operatorname {E} (X_{i})=np_{i}=n{\frac {\alpha _{i}}{\alpha _{0}}}.\,}

Thecovariance matrix is as follows. Each diagonal entry is thevariance of a beta-binomially distributed random variable, and is therefore

var(Xi)=npi(1pi)(n+αk1+αk)=nαiα0(1αiα0)(n+α01+α0).{\displaystyle \operatorname {var} (X_{i})=np_{i}(1-p_{i})\left({\frac {n+\sum \alpha _{k}}{1+\sum \alpha _{k}}}\right)=n{\frac {\alpha _{i}}{\alpha _{0}}}\left(1-{\frac {\alpha _{i}}{\alpha _{0}}}\right)\left({\frac {n+\alpha _{0}}{1+\alpha _{0}}}\right).\,}

The off-diagonal entries are thecovariances:

cov(Xi,Xj)=npipj(n+αk1+αk)=nαiαjα02(n+α01+α0){\displaystyle \operatorname {cov} (X_{i},X_{j})=-np_{i}p_{j}\left({\frac {n+\sum \alpha _{k}}{1+\sum \alpha _{k}}}\right)=-n{\frac {\alpha _{i}\alpha _{j}}{\alpha _{0}^{2}}}\left({\frac {n+\alpha _{0}}{1+\alpha _{0}}}\right)\,}

fori,j distinct.

All covariances are negative because for fixedn, an increase in one component of a Dirichlet-multinomial vector requires a decrease in another component.

This is aK ×Kpositive-semidefinite matrix ofrankK − 1.

The entries of the correspondingcorrelation matrix are

ρ(Xi,Xi)=1.{\displaystyle \rho (X_{i},X_{i})=1.}
ρ(Xi,Xj)=cov(Xi,Xj)var(Xi)var(Xj)=pipj(n+α01+α0)pi(1pi)(n+α01+α0)pj(1pj)(n+α01+α0)=αiαj(α0αi)(α0αj).{\displaystyle \rho (X_{i},X_{j})={\frac {\operatorname {cov} (X_{i},X_{j})}{\sqrt {\operatorname {var} (X_{i})\operatorname {var} (X_{j})}}}={\frac {-p_{i}p_{j}({\frac {n+\alpha _{0}}{1+\alpha _{0}}})}{\sqrt {p_{i}(1-p_{i})({\frac {n+\alpha _{0}}{1+\alpha _{0}}})p_{j}(1-p_{j})({\frac {n+\alpha _{0}}{1+\alpha _{0}}})}}}=-{\sqrt {\frac {\alpha _{i}\alpha _{j}}{(\alpha _{0}-\alpha _{i})(\alpha _{0}-\alpha _{j})}}}.}

The sample size drops out of this expression.

Each of thek components separately has a beta-binomial distribution.

Thesupport of the Dirichlet-multinomial distribution is the set

{(n1,,nk)Nk|n1++nk=n}.{\displaystyle \{(n_{1},\dots ,n_{k})\in \mathbb {N} ^{k}|n_{1}+\cdots +n_{k}=n\}.\,}

Its number of elements is

(n+k1k1).{\displaystyle {n+k-1 \choose k-1}.}

Matrix notation

[edit]

In matrix notation,

E(X)=np,{\displaystyle \operatorname {E} (\mathbf {X} )=n\mathbf {p} ,\,}

and

var(X)=n{diag(p)ppT}(n+α01+α0),{\displaystyle \operatorname {var} (\mathbf {X} )=n\lbrace \operatorname {diag} (\mathbf {p} )-\mathbf {p} \mathbf {p} ^{\rm {T}}\rbrace \left({\frac {n+\alpha _{0}}{1+\alpha _{0}}}\right),\,}

withpT = the row vector transpose of the column vectorp. Letting

α0=1ρ2ρ2{\displaystyle \alpha _{0}={\frac {1-\rho ^{2}}{\rho ^{2}}}\,}, we can write alternatively
var(X)=n{diag(p)ppT}(1+ρ2(n1)),{\displaystyle \operatorname {var} (\mathbf {X} )=n\lbrace \operatorname {diag} (\mathbf {p} )-\mathbf {p} \mathbf {p} ^{\rm {T}}\rbrace (1+\rho ^{2}(n-1)),\,}

The parameterρ{\displaystyle \rho \!} is known as the "intra class" or "intra cluster" correlation. It is this positive correlation which gives rise to overdispersion relative to the multinomial distribution.

Aggregation

[edit]

If

X=(X1,,XK)DM(α1,,αK){\displaystyle X=(X_{1},\ldots ,X_{K})\sim \operatorname {DM} (\alpha _{1},\cdots ,\alpha _{K})}

then, if the random variables with subscriptsi andj are dropped from the vector and replaced by their sum[citation needed],

X=(X1,,Xi+Xj,,XK)DM(α1,,αi+αj,,αK).{\displaystyle X'=(X_{1},\ldots ,X_{i}+X_{j},\ldots ,X_{K})\sim \operatorname {DM} \left(\alpha _{1},\cdots ,\alpha _{i}+\alpha _{j},\cdots ,\alpha _{K}\right).}

This aggregation property may be used to derive the marginal distribution ofXi{\displaystyle X_{i}}.

Likelihood function

[edit]

Conceptually, we are makingN independent draws from a categorical distribution withK categories. Let us represent the independent draws as random categorical variableszn{\displaystyle z_{n}} forn=1N{\displaystyle n=1\dots N}. Let us denote the number of times a particular categoryk{\displaystyle k} has been seen (fork=1K{\displaystyle k=1\dots K}) among all the categorical variables asnk{\displaystyle n_{k}}, andknk=N{\displaystyle \sum _{k}n_{k}=N}. Then, we have two separate views onto this problem:

  1. A set ofN{\displaystyle N} categorical variablesz1,,zN{\displaystyle z_{1},\dots ,z_{N}}.
  2. A single vector-valued variablex=(n1,,nK){\displaystyle \mathbf {x} =(n_{1},\dots ,n_{K})}, distributed according to amultinomial distribution.

The former case is a set of random variables specifying eachindividual outcome, while the latter is a variable specifying thenumber of outcomes of each of theK categories. The distinction is important, as the two cases have correspondingly different probability distributions.

The parameter of the categorical distribution isp=(p1,p2,,pK),{\displaystyle \mathbf {p} =(p_{1},p_{2},\dots ,p_{K}),} wherepk{\displaystyle p_{k}} is the probability to draw valuek{\displaystyle k};p{\displaystyle \mathbf {p} } is likewise the parameter of the multinomial distributionP(x|p){\displaystyle P(\mathbf {x} |\mathbf {p} )}. Rather than specifyingp{\displaystyle \mathbf {p} } directly, we give it aconjugate prior distribution, and hence it is drawn from a Dirichlet distribution with parameter vectorα=(α1,α2,,αK){\displaystyle {\boldsymbol {\alpha }}=(\alpha _{1},\alpha _{2},\ldots ,\alpha _{K})}.

By integrating outp{\displaystyle \mathbf {p} }, we obtain a compound distribution. However, the form of the distribution is different depending on which view we take.

For a set of individual outcomes

[edit]

Joint distribution

[edit]

For categorical variablesZ=z1,,zN{\displaystyle \mathbb {Z} =z_{1},\dots ,z_{N}}, themarginaljoint distribution is obtained by integrating outp{\displaystyle \mathbf {p} }:

Pr(Zα)=pPr(Zp)Pr(pα)dp{\displaystyle \Pr(\mathbb {Z} \mid {\boldsymbol {\alpha }})=\int _{\mathbf {p} }\Pr(\mathbb {Z} \mid \mathbf {p} )\Pr(\mathbf {p} \mid {\boldsymbol {\alpha }}){\textrm {d}}\mathbf {p} }

which results in the following explicit formula:

Pr(Zα)=Γ(α0)Γ(N+α0)k=1KΓ(nk+αk)Γ(αk){\displaystyle \Pr(\mathbb {Z} \mid {\boldsymbol {\alpha }})={\frac {\Gamma \left(\alpha _{0}\right)}{\Gamma \left(N+\alpha _{0}\right)}}\prod _{k=1}^{K}{\frac {\Gamma (n_{k}+\alpha _{k})}{\Gamma (\alpha _{k})}}}

whereΓ{\displaystyle \Gamma } is thegamma function, with

α0=kαk and N=knk, and where nk=number of zn's with the value k.{\displaystyle \alpha _{0}=\sum _{k}\alpha _{k}{\text{ and }}N=\sum _{k}n_{k}{\text{, and where }}n_{k}={\text{number of }}z_{n}{\text{'s with the value }}k.}

Note the absence of the multinomial coefficient due to the formula being about the probability of a sequence of categorical variables instead of a probability on the counts within each category.

Although the variablesz1,,zN{\displaystyle z_{1},\dots ,z_{N}} do not appear explicitly in the above formula, they enter in through thenk{\displaystyle n_{k}} values.[clarification needed]

Conditional distribution

[edit]

Another useful formula, particularly in the context ofGibbs sampling, asks what the conditional density of a given variablezn{\displaystyle z_{n}} is, conditioned on all the other variables (which we will denoteZ(n){\displaystyle \mathbb {Z} ^{(-n)}}). It turns out to have an extremely simple form:

Pr(zn=kZ(n),α)nk(n)+αk{\displaystyle \Pr(z_{n}=k\mid \mathbb {Z} ^{(-n)},{\boldsymbol {\alpha }})\propto n_{k}^{(-n)}+\alpha _{k}}

wherenk(n){\displaystyle n_{k}^{(-n)}} specifies the number of counts of categoryk{\displaystyle k} seen in all variables other thanzn{\displaystyle z_{n}}.

It may be useful to show how to derive this formula. In general,conditional distributions are proportional to the correspondingjoint distributions, so we simply start with the above formula for the joint distribution of all thez1,,zN{\displaystyle z_{1},\dots ,z_{N}} values and then eliminate any factors not dependent on the particularzn{\displaystyle z_{n}} in question. To do this, we make use of the notationnk(n){\displaystyle n_{k}^{(-n)}} defined above, and

nj={nj(n),if jknj(n)+1,if j=k{\displaystyle n_{j}={\begin{cases}n_{j}^{(-n)},&{\text{if }}j\not =k\\n_{j}^{(-n)}+1,&{\text{if }}j=k\end{cases}}}

We also use the fact that

Γ(n+1)=nΓ(n){\displaystyle \Gamma (n+1)=n\Gamma (n)}

Then:

Pr(zn=kZ(n),α) Pr(zn=k,Z(n)α)=  Γ(α0)Γ(N+α0)j=1KΓ(nj+αj)Γ(αj) j=1KΓ(nj+αj)= Γ(nk+αk)jkΓ(nj+αj)= Γ(nk(n)+1+αk)jkΓ(nj(n)+αj)= (nk(n)+αk)Γ(nk(n)+αk)jkΓ(nj(n)+αj)= (nk(n)+αk)jΓ(nj(n)+αj) nk(n)+αk{\displaystyle {\begin{aligned}&\Pr(z_{n}=k\mid \mathbb {Z} ^{(-n)},{\boldsymbol {\alpha }})\\\propto \ &\Pr(z_{n}=k,\mathbb {Z} ^{(-n)}\mid {\boldsymbol {\alpha }})\\=\ &\ {\frac {\Gamma \left(\alpha _{0}\right)}{\Gamma \left(N+\alpha _{0}\right)}}\prod _{j=1}^{K}{\frac {\Gamma (n_{j}+\alpha _{j})}{\Gamma (\alpha _{j})}}\\\propto \ &\prod _{j=1}^{K}\Gamma (n_{j}+\alpha _{j})\\=\ &\Gamma (n_{k}+\alpha _{k})\prod _{j\not =k}\Gamma (n_{j}+\alpha _{j})\\=\ &\Gamma (n_{k}^{(-n)}+1+\alpha _{k})\prod _{j\not =k}\Gamma (n_{j}^{(-n)}+\alpha _{j})\\=\ &(n_{k}^{(-n)}+\alpha _{k})\Gamma (n_{k}^{(-n)}+\alpha _{k})\prod _{j\not =k}\Gamma (n_{j}^{(-n)}+\alpha _{j})\\=\ &(n_{k}^{(-n)}+\alpha _{k})\prod _{j}\Gamma (n_{j}^{(-n)}+\alpha _{j})\\\propto \ &n_{k}^{(-n)}+\alpha _{k}\\\end{aligned}}}

In general, it is not necessary to worry about thenormalizing constant at the time of deriving the equations for conditional distributions. The normalizing constant will be determined as part of the algorithm for sampling from the distribution (seeCategorical distribution#Sampling). However, when the conditional distribution is written in the simple form above, it turns out that the normalizing constant assumes a simple form:

k(nk(n)+αk)=α0+knk(n)=α0+N1{\displaystyle \sum _{k}\left(n_{k}^{(-n)}+\alpha _{k}\right)=\alpha _{0}+\sum _{k}n_{k}^{(-n)}=\alpha _{0}+N-1}

Hence

Pr(zn=kZ(n),α)=nk(n)+αkα0+N1{\displaystyle \Pr(z_{n}=k\mid \mathbb {Z} ^{(-n)},{\boldsymbol {\alpha }})={\frac {n_{k}^{(-n)}+\alpha _{k}}{\alpha _{0}+N-1}}}

This formula is closely related to theChinese restaurant process, which results from taking the limit asK{\displaystyle K\to \infty }.

In a Bayesian network

[edit]

In a largerBayesian network in which categorical (or so-called "multinomial") distributions occur withDirichlet distribution priors as part of a larger network, all Dirichlet priors can be collapsed provided that the only nodes depending on them are categorical distributions. The collapsing happens for each Dirichlet-distribution node separately from the others, and occurs regardless of any other nodes that may depend on the categorical distributions. It also occurs regardless of whether the categorical distributions depend on nodes additional to the Dirichlet priors (although in such a case, those other nodes must remain as additional conditioning factors). Essentially, all of the categorical distributions depending on a given Dirichlet-distribution node become connected into a single Dirichlet-multinomial joint distribution defined by the above formula. The joint distribution as defined this way will depend on the parent(s) of the integrated-out Dirichet prior nodes, as well as any parent(s) of the categorical nodes other than the Dirichlet prior nodes themselves.

In the following sections, we discuss different configurations commonly found in Bayesian networks. We repeat the probability density from above, and define it using the symbolDirMult(Zα){\displaystyle \operatorname {DirMult} (\mathbb {Z} \mid {\boldsymbol {\alpha }})}:

Pr(Zα)=DirMult(Zα)=Γ(kαk)Γ(knk+αk)k=1KΓ(nk+αk)Γ(αk){\displaystyle \Pr(\mathbb {Z} \mid {\boldsymbol {\alpha }})=\operatorname {DirMult} (\mathbb {Z} \mid {\boldsymbol {\alpha }})={\frac {\Gamma \left(\sum _{k}\alpha _{k}\right)}{\Gamma \left(\sum _{k}n_{k}+\alpha _{k}\right)}}\prod _{k=1}^{K}{\frac {\Gamma (n_{k}+\alpha _{k})}{\Gamma (\alpha _{k})}}}
Multiple Dirichlet priors with the same hyperprior
[edit]

Imagine we have a hierarchical model as follows:

αsome distributionθd=1MDirichletK(α)zd=1M,n=1NdCategoricalK(θd){\displaystyle {\begin{array}{lcl}{\boldsymbol {\alpha }}&\sim &{\text{some distribution}}\\{\boldsymbol {\theta }}_{d=1\dots M}&\sim &\operatorname {Dirichlet} _{K}({\boldsymbol {\alpha }})\\z_{d=1\dots M,n=1\dots N_{d}}&\sim &\operatorname {Categorical} _{K}({\boldsymbol {\theta }}_{d})\end{array}}}

In cases like this, we have multiple Dirichet priors, each of which generates some number of categorical observations (possibly a different number for each prior). The fact that they are all dependent on the same hyperprior, even if this is a random variable as above, makes no difference. The effect of integrating out a Dirichlet prior links the categorical variables attached to that prior, whose joint distribution simply inherits any conditioning factors of the Dirichlet prior. The fact that multiple priors may share a hyperprior makes no difference:

Pr(Zα)=dDirMult(Zdα){\displaystyle \Pr(\mathbb {Z} \mid {\boldsymbol {\alpha }})=\prod _{d}\operatorname {DirMult} (\mathbb {Z} _{d}\mid {\boldsymbol {\alpha }})}

whereZd{\displaystyle \mathbb {Z} _{d}} is simply the collection of categorical variables dependent on priord.

Accordingly, the conditional probability distribution can be written as follows:

Pr(zdn=kZ(dn),α)  nk,d(n)+αk{\displaystyle \Pr(z_{dn}=k\mid \mathbb {Z} ^{(-dn)},{\boldsymbol {\alpha }})\ \propto \ n_{k,d}^{(-n)}+\alpha _{k}}

wherenk,d(n){\displaystyle n_{k,d}^{(-n)}} specifically means the number of variablesamong the setZd{\displaystyle \mathbb {Z} _{d}}, excludingzdn{\displaystyle z_{dn}} itself, that have the valuek{\displaystyle k} .

It is necessary to countonly the variables having the valuek that are tied together to the variable in question through having the same prior. We donot want to count any other variables also having the valuek.

Multiple Dirichlet priors with the same hyperprior, with dependent children
[edit]

Now imagine a slightly more complicated hierarchical model as follows:

αsome distributionθd=1MDirichletK(α)zd=1M,n=1NdCategoricalK(θd)ϕsome other distributionwd=1M,n=1NdF(wdnzdn,ϕ){\displaystyle {\begin{array}{lcl}{\boldsymbol {\alpha }}&\sim &{\text{some distribution}}\\{\boldsymbol {\theta }}_{d=1\dots M}&\sim &\operatorname {Dirichlet} _{K}({\boldsymbol {\alpha }})\\z_{d=1\dots M,n=1\dots N_{d}}&\sim &\operatorname {Categorical} _{K}({\boldsymbol {\theta }}_{d})\\{\boldsymbol {\phi }}&\sim &{\text{some other distribution}}\\w_{d=1\dots M,n=1\dots N_{d}}&\sim &\operatorname {F} (w_{dn}\mid z_{dn},{\boldsymbol {\phi }})\end{array}}}

This model is the same as above, but in addition, each of the categorical variables has a child variable dependent on it. This is typical of amixture model.

Again, in the joint distribution, only the categorical variables dependent on the same prior are linked into a single Dirichlet-multinomial:

Pr(Z,Wα,ϕ)=dDirMult(Zdα)d=1Mn=1NdF(wdnzdn,ϕ){\displaystyle \Pr(\mathbb {Z} ,\mathbb {W} \mid {\boldsymbol {\alpha }},{\boldsymbol {\phi }})=\prod _{d}\operatorname {DirMult} (\mathbb {Z} _{d}\mid {\boldsymbol {\alpha }})\prod _{d=1}^{M}\prod _{n=1}^{N_{d}}\operatorname {F} (w_{dn}\mid z_{dn},{\boldsymbol {\phi }})}

The conditional distribution of the categorical variables dependent only on their parents and ancestors would have the identical form as above in the simpler case. However, in Gibbs sampling it is necessary to determine the conditional distribution of a given nodezdn{\displaystyle z_{dn}} dependent not only onZ(dn){\displaystyle \mathbb {Z} ^{(-dn)}} and ancestors such asα{\displaystyle \alpha } but onall the other parameters.

The simplified expression for the conditional distribution is derived above simply by rewriting the expression for the joint probability and removing constant factors. Hence, the same simplification would apply in a larger joint probability expression such as the one in this model, composed of Dirichlet-multinomial densities plus factors for many other random variables dependent on the values of the categorical variables.

This yields the following:

Pr(zdn=kZ(dn),W,α,ϕ)  (nk,d(n)+αk)F(wdnzdn,ϕ){\displaystyle \Pr(z_{dn}=k\mid \mathbb {Z} ^{(-dn)},\mathbb {W} ,{\boldsymbol {\alpha }},{\boldsymbol {\phi }})\ \propto \ (n_{k,d}^{(-n)}+\alpha _{k})\operatorname {F} (w_{dn}\mid z_{dn},{\boldsymbol {\phi }})}

Here the probability density ofF{\displaystyle \operatorname {F} } appears directly. To dorandom sampling overzdn{\displaystyle z_{dn}}, we would compute the unnormalized probabilities for allK possibilities forzdn{\displaystyle z_{dn}} using the above formula, then normalize them and proceed as normal using the algorithm described in thecategorical distribution article.

Correctly speaking, the additional factor that appears in the conditional distribution is derived not from the model specification but directly from the joint distribution. This distinction is important when considering models where a given node with Dirichlet-prior parent has multiple dependent children, particularly when those children are dependent on each other (e.g. if they share a parent that is collapsed out). This is discussed more below.

Multiple Dirichlet priors with shifting prior membership
[edit]

Now imagine we have a hierarchical model as follows:

θsome distributionzn=1NCategoricalK(θ)αsome distributionϕk=1KDirichletV(α)wn=1NCategoricalV(ϕzn){\displaystyle {\begin{array}{lcl}{\boldsymbol {\theta }}&\sim &{\text{some distribution}}\\z_{n=1\dots N}&\sim &\operatorname {Categorical} _{K}({\boldsymbol {\theta }})\\{\boldsymbol {\alpha }}&\sim &{\text{some distribution}}\\{\boldsymbol {\phi }}_{k=1\dots K}&\sim &\operatorname {Dirichlet} _{V}({\boldsymbol {\alpha }})\\w_{n=1\dots N}&\sim &\operatorname {Categorical} _{V}({\boldsymbol {\phi }}_{z_{n}})\\\end{array}}}

Here we have a tricky situation where we have multiple Dirichlet priors as before and a set of dependent categorical variables, but the relationship between the priors and dependent variables isn't fixed, unlike before. Instead, the choice of which prior to use is dependent on another random categorical variable. This occurs, for example, in topic models, and indeed the names of the variables above are meant to correspond to those inlatent Dirichlet allocation. In this case, the setW{\displaystyle \mathbb {W} } is a set of words, each of which is drawn from one ofK{\displaystyle K} possible topics, where each topic is a Dirichlet prior over a vocabulary ofV{\displaystyle V} possible words, specifying the frequency of different words in the topic. However, the topic membership of a given word isn't fixed; rather, it's determined from a set oflatent variablesZ{\displaystyle \mathbb {Z} }. There is one latent variable per word, aK{\displaystyle K} -dimensionalcategorical variable specifying the topic the word belongs to.

In this case, all variables dependent on a given prior are tied together (i.e.correlated) in a group, as before — specifically, all words belonging to a given topic are linked. In this case, however, the group membership shifts, in that the words are not fixed to a given topic but the topic depends on the value of a latent variable associated with the word. However, the definition of the Dirichlet-multinomial density doesn't actually depend on the number of categorical variables in a group (i.e. the number of words in the document generated from a given topic), but only on the counts of how many variables in the group have a given value (i.e. among all the word tokens generated from a given topic, how many of them are a given word). Hence, we can still write an explicit formula for the joint distribution:

Pr(Wα,Z)=k=1KDirMult(WkZ,α)=k=1K[Γ(vαv)Γ(vnvk+αv)v=1VΓ(nvk+αv)Γ(αv)]{\displaystyle \Pr(\mathbb {W} \mid {\boldsymbol {\alpha }},\mathbb {Z} )=\prod _{k=1}^{K}\operatorname {DirMult} (\mathbb {W} _{k}\mid \mathbb {Z} ,{\boldsymbol {\alpha }})=\prod _{k=1}^{K}\left[{\frac {\Gamma \left(\sum _{v}\alpha _{v}\right)}{\Gamma \left(\sum _{v}n_{v}^{k}+\alpha _{v}\right)}}\prod _{v=1}^{V}{\frac {\Gamma (n_{v}^{k}+\alpha _{v})}{\Gamma (\alpha _{v})}}\right]}

Here we use the notationnvk{\displaystyle n_{v}^{k}} to denote the number of word tokens whose value is word symbolv and which belong to topick.

The conditional distribution still has the same form:

Pr(wn=vW(n),Z,α)  nvk,(n)+αv{\displaystyle \Pr(w_{n}=v\mid \mathbb {W} ^{(-n)},\mathbb {Z} ,{\boldsymbol {\alpha }})\ \propto \ n_{v}^{k,(-n)}+\alpha _{v}}

Here again,only the categorical variables for words belonging to a given topic are linked (even though this linking will depend on the assignments of the latent variables), and hence the word counts need to be overonly the words generated by a given topic. Hence the symbolnvk,(n){\displaystyle n_{v}^{k,(-n)}}, which is the count of words tokens having the word symbolv, butonly among those generated by topick, and excluding the word itself whose distribution is being described.

(The reason why excluding the word itself is necessary, and why it even makes sense at all, is that in aGibbs sampling context, we repeatedly resample the values of each random variable, after having run through and sampled all previous variables. Hence the variable will already have a value, and we need to exclude this existing value from the various counts that we make use of.)

A combined example: LDA topic models
[edit]

We now show how to combine some of the above scenarios to demonstrate how toGibbs sample a real-world model, specifically a smoothedlatent Dirichlet allocation (LDA)topic model.

The model is as follows:

αA Dirichlet hyperprior, either a constant or a random variableβA Dirichlet hyperprior, either a constant or a random variableθd=1MDirichletK(α)ϕk=1KDirichletV(β)zd=1M,n=1NdCategoricalK(θd)wd=1M,n=1NdCategoricalV(ϕzdn){\displaystyle {\begin{array}{lcl}{\boldsymbol {\alpha }}&\sim &{\text{A Dirichlet hyperprior, either a constant or a random variable}}\\{\boldsymbol {\beta }}&\sim &{\text{A Dirichlet hyperprior, either a constant or a random variable}}\\{\boldsymbol {\theta }}_{d=1\dots M}&\sim &\operatorname {Dirichlet} _{K}({\boldsymbol {\alpha }})\\{\boldsymbol {\phi }}_{k=1\dots K}&\sim &\operatorname {Dirichlet} _{V}({\boldsymbol {\beta }})\\z_{d=1\dots M,n=1\dots N_{d}}&\sim &\operatorname {Categorical} _{K}({\boldsymbol {\theta }}_{d})\\w_{d=1\dots M,n=1\dots N_{d}}&\sim &\operatorname {Categorical} _{V}({\boldsymbol {\phi }}_{z_{dn}})\\\end{array}}}

Essentially we combine the previous three scenarios: We have categorical variables dependent on multiple priors sharing a hyperprior; we have categorical variables with dependent children (thelatent variable topic identities); and we have categorical variables with shifting membership in multiple priors sharing a hyperprior. In the standard LDA model, the words are completely observed, and hence we never need to resample them. (However, Gibbs sampling would equally be possible if only some or none of the words were observed. In such a case, we would want to initialize the distribution over the words in some reasonable fashion — e.g. from the output of some process that generates sentences, such as amachine translation model — in order for the resultingposterior latent variable distributions to make any sense.)

Using the above formulas, we can write down the conditional probabilities directly:

Pr(wdn=vW(dn),Z,β)  #Wvk,(dn)+βvPr(zdn=kZ(dn),wdn=v,W(dn),α)  (#Zkd,(dn)+αk)Pr(wdn=vW(dn),Z,β){\displaystyle {\begin{array}{lcl}\Pr(w_{dn}=v\mid \mathbb {W} ^{(-dn)},\mathbb {Z} ,{\boldsymbol {\beta }})\ &\propto \ &\#\mathbb {W} _{v}^{k,(-dn)}+\beta _{v}\\\Pr(z_{dn}=k\mid \mathbb {Z} ^{(-dn)},w_{dn}=v,\mathbb {W} ^{(-dn)},{\boldsymbol {\alpha }})\ &\propto \ &(\#\mathbb {Z} _{k}^{d,(-dn)}+\alpha _{k})\Pr(w_{dn}=v\mid \mathbb {W} ^{(-dn)},\mathbb {Z} ,{\boldsymbol {\beta }})\\\end{array}}}

Here we have defined the counts more explicitly to clearly separate counts of words and counts of topics:

#Wvk,(dn)=number of words having value v among topic k excluding wdn#Zkd,(dn)=number of topics having value k among document d excluding zdn{\displaystyle {\begin{array}{lcl}\#\mathbb {W} _{v}^{k,(-dn)}&=&{\text{number of words having value }}v{\text{ among topic }}k{\text{ excluding }}w_{dn}\\\#\mathbb {Z} _{k}^{d,(-dn)}&=&{\text{number of topics having value }}k{\text{ among document }}d{\text{ excluding }}z_{dn}\\\end{array}}}

As in the scenario above with categorical variables with dependent children, the conditional probability of those dependent children appears in the definition of the parent's conditional probability. In this case, each latent variable has only a single dependent child word, so only one such term appears. (If there were multiple dependent children, all would have to appear in the parent's conditional probability, regardless of whether there was overlap between different parents and the same children, i.e. regardless of whether the dependent children of a given parent also have other parents. In a case where a child has multiple parents, the conditional probability for that child appears in the conditional probability definition of each of its parents.)

The definition above specifies only theunnormalized conditional probability of the words, while the topic conditional probability requires theactual (i.e. normalized) probability. Hence we have to normalize by summing over all word symbols:

Pr(zdn=kZ(dn),wdn=v,W(dn),α)  (#Zkd,(dn)+αk)#Wvk,(dn)+βvv=1V(#Wvk,(dn)+βv)=(#Zkd,(dn)+αk)#Wvk,(dn)+βv#Wk+B1{\displaystyle {\begin{array}{rcl}\Pr(z_{dn}=k\mid \mathbb {Z} ^{(-dn)},w_{dn}=v,\mathbb {W} ^{(-dn)},{\boldsymbol {\alpha }})\ &\propto \ &{\bigl (}\#\mathbb {Z} _{k}^{d,(-dn)}+\alpha _{k}{\bigr )}{\dfrac {\#\mathbb {W} _{v}^{k,(-dn)}+\beta _{v}}{\sum _{v'=1}^{V}(\#\mathbb {W} _{v'}^{k,(-dn)}+\beta _{v'})}}\\&&\\&=&{\bigl (}\#\mathbb {Z} _{k}^{d,(-dn)}+\alpha _{k}{\bigr )}{\dfrac {\#\mathbb {W} _{v}^{k,(-dn)}+\beta _{v}}{\#\mathbb {W} ^{k}+B-1}}\end{array}}}

where

#Wk=number of words generated by topic kB=v=1Vβv{\displaystyle {\begin{array}{lcl}\#\mathbb {W} ^{k}&=&{\text{number of words generated by topic }}k\\B&=&\sum _{v=1}^{V}\beta _{v}\\\end{array}}}

It's also worth making another point in detail, which concerns the second factor above in the conditional probability. Remember that the conditional distribution in general is derived from the joint distribution, and simplified by removing terms not dependent on the domain of the conditional (the part on the left side of the vertical bar). When a nodez{\displaystyle z} has dependent children, there will be one or more factorsF(z){\displaystyle \operatorname {F} (\dots \mid z)} in the joint distribution that are dependent onz{\displaystyle z}.Usually there is one factor for each dependent node, and it has the same density function as the distribution appearing the mathematical definition. However, if a dependent node has another parent as well (a co-parent), and that co-parent is collapsed out, then the node will become dependent on all other nodes sharing that co-parent, and in place of multiple terms for each such node, the joint distribution will have only one joint term. We have exactly that situation here. Even thoughzdn{\displaystyle z_{dn}} has only one childwdn{\displaystyle w_{dn}}, that child has a Dirichlet co-parent that we have collapsed out, which induces a Dirichlet-multinomial over the entire set of nodesWk{\displaystyle \mathbb {W} ^{k}}.

It happens in this case that this issue does not cause major problems, precisely because of the one-to-one relationship betweenzdn{\displaystyle z_{dn}} andwdn{\displaystyle w_{dn}}. We can rewrite the joint distribution as follows:

p(Wkzdn)=p(wdnWk,(dn),zdn)p(Wk,(dn)zdn)=p(wdnWk,(dn),zdn)p(Wk,(dn))p(wdnWk,(dn),zdn){\displaystyle {\begin{array}{lcl}p(\mathbb {W} ^{k}\mid z_{dn})&=&p(w_{dn}\mid \mathbb {W} ^{k,(-dn)},z_{dn})\,p(\mathbb {W} ^{k,(-dn)}\mid z_{dn})\\&=&p(w_{dn}\mid \mathbb {W} ^{k,(-dn)},z_{dn})\,p(\mathbb {W} ^{k,(-dn)})\\&\sim &p(w_{dn}\mid \mathbb {W} ^{k,(-dn)},z_{dn})\end{array}}}

where in the setWk,(dn){\displaystyle \mathbb {W} ^{k,(-dn)}} (i.e. the set of nodesWk{\displaystyle \mathbb {W} ^{k}} excludingwdn{\displaystyle w_{dn}} ), none of the nodes havezdn{\displaystyle z_{dn}} as a parent. Hence it can be eliminated as a conditioning factor (line 2), meaning that the entire factor can be eliminated from the conditional distribution (line 3).

A second example: Naive Bayes document clustering
[edit]

Here is another model, with a different set of issues. This is an implementation of an unsupervisedNaive Bayes model for document clustering. That is, we would like toclassify documents into multiple categories (e.g. "spam" or "non-spam", or "scientific journal article", "newspaper article about finance", "newspaper article about politics", "love letter") based on textual content. However, we don't already know the correct category of any documents; instead, we want tocluster them based on mutual similarities. (For example, a set of scientific articles will tend to be similar to each other in word use but very different from a set of love letters.) This is a type ofunsupervised learning. (The same technique can be used for doingsemi-supervised learning, i.e. where we know the correct category of some fraction of the documents and would like to use this knowledge to help in clustering the remaining documents.)

The model is as follows:

αA Dirichlet hyperprior, either a constant or a random variableβA Dirichlet hyperprior, either a constant or a random variableθd=1MDirichletK(α)ϕk=1KDirichletV(β)zd=1MCategoricalK(θd)wd=1M,n=1NdCategoricalV(ϕzd){\displaystyle {\begin{array}{lcl}{\boldsymbol {\alpha }}&\sim &{\text{A Dirichlet hyperprior, either a constant or a random variable}}\\{\boldsymbol {\beta }}&\sim &{\text{A Dirichlet hyperprior, either a constant or a random variable}}\\{\boldsymbol {\theta }}_{d=1\dots M}&\sim &\operatorname {Dirichlet} _{K}({\boldsymbol {\alpha }})\\{\boldsymbol {\phi }}_{k=1\dots K}&\sim &\operatorname {Dirichlet} _{V}({\boldsymbol {\beta }})\\z_{d=1\dots M}&\sim &\operatorname {Categorical} _{K}({\boldsymbol {\theta }}_{d})\\w_{d=1\dots M,n=1\dots N_{d}}&\sim &\operatorname {Categorical} _{V}({\boldsymbol {\phi }}_{z_{d}})\\\end{array}}}

In many ways, this model is very similar to theLDAtopic model described above, but it assumes one topic per document rather than one topic per word, with a document consisting of a mixture of topics. This can be seen clearly in the above model, which is identical to the LDA model except that there is only onelatent variable per document instead of one per word. Once again, we assume that we are collapsing all of the Dirichlet priors.

The conditional probability for a given word is almost identical to the LDA case. Once again, all words generated by the same Dirichlet prior are interdependent. In this case, this means the words of all documents having a given label — again, this can vary depending on the label assignments, but all we care about is the total counts. Hence:

Pr(wdn=vW(dn),Z,β)  #Wvk,(dn)+βv{\displaystyle {\begin{array}{lcl}\Pr(w_{dn}=v\mid \mathbb {W} ^{(-dn)},\mathbb {Z} ,{\boldsymbol {\beta }})\ &\propto \ &\#\mathbb {W} _{v}^{k,(-dn)}+\beta _{v}\\\end{array}}}

where

#Wvk,(dn)=number of words having value v among documents with label k excluding wdn{\displaystyle {\begin{array}{lcl}\#\mathbb {W} _{v}^{k,(-dn)}&=&{\text{number of words having value }}v{\text{ among documents with label }}k{\text{ excluding }}w_{dn}\\\end{array}}}

However, there is a critical difference in the conditional distribution of the latent variables for the label assignments, which is that a given label variable has multiple children nodes instead of just one — in particular, the nodes for all the words in the label's document. This relates closely to the discussion above about the factorF(zd){\displaystyle \operatorname {F} (\dots \mid z_{d})} that stems from the joint distribution. In this case, the joint distribution needs to be taken over all words in all documents containing a label assignment equal to the value ofzd{\displaystyle z_{d}}, and has the value of a Dirichlet-multinomial distribution. Furthermore, we cannot reduce this joint distribution down to a conditional distribution over a single word. Rather, we can reduce it down only to a smaller joint conditional distribution over the words in the document for the label in question, and hence we cannot simplify it using the trick above that yields a simple sum of expected count and prior. Although it is in fact possible to rewrite it as a product of such individual sums, the number of factors is very large, and is not clearly more efficient than directly computing the Dirichlet-multinomial distribution probability.

Related distributions

[edit]

The one-dimensional version of the Dirichlet-multinomial distribution is known as theBeta-binomial distribution.

The Dirichlet-multinomial distribution has a relationship with thenegative binomial distribution analogous to the relationship of themultinomial distribution with thePoisson distribution.[2]

Uses

[edit]

The Dirichlet-multinomial distribution is used in automateddocument classification and clustering,genetics,economy, combat modeling, and quantitative marketing.

This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(June 2012) (Learn how and when to remove this message)

See also

[edit]

References

[edit]

Citations

[edit]
  1. ^abcdGlüsenkamp, T. (2018). "Probabilistic treatment of the uncertainty from the finite size of weighted Monte Carlo data".EPJ Plus.133 (6): 218.arXiv:1712.01293.Bibcode:2018EPJP..133..218G.doi:10.1140/epjp/i2018-12042-x.S2CID 125665629.
  2. ^Theorem 1 ofZhou, M. (2018)."Nonparametric Bayesian Negative Binomial Factor Analysis".Bayesian Analysis.13 (4):1065–1093.arXiv:1604.07464.doi:10.1214/17-BA1070.

Sources

[edit]
Discrete
univariate
with finite
support
with infinite
support
Continuous
univariate
supported on a
bounded interval
supported on a
semi-infinite
interval
supported
on the whole
real line
with support
whose type varies
Mixed
univariate
continuous-
discrete
Multivariate
(joint)
Directional
Degenerate
andsingular
Degenerate
Dirac delta function
Singular
Cantor
Families
Retrieved from "https://en.wikipedia.org/w/index.php?title=Dirichlet-multinomial_distribution&oldid=1321590854"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp