Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Dirichlet process

From Wikipedia, the free encyclopedia
Family of stochastic processes
Draws from the Dirichlet processDP(N(0,1),α){\displaystyle \operatorname {DP} (N(0,1),\alpha )}. The four rows use different alphaα{\displaystyle \alpha } (top to bottom: 1, 10, 100 and 1000) and each row contains three repetitions of the same experiment. As seen from the graphs, draws from a Dirichlet process are discrete distributions and they become less concentrated (more spread out) with increasingα{\displaystyle \alpha }. The graphs were generated using thestick-breaking process view of the Dirichlet process.

Inprobability theory,Dirichlet processes (after the distribution associated withPeter Gustav Lejeune Dirichlet) are a family ofstochastic processes whoserealizations areprobability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used inBayesian inference to describe theprior knowledge about the distribution ofrandom variables—how likely it is that the random variables are distributed according to one or another particular distribution.

As an example, a bag of 100 real-world dice is arandomprobability mass function (random pmf)—to sample this random pmf you put your hand in the bag and draw out a die, that is, you draw a pmf. A bag of dice manufactured using a crude process 100 years ago will likely have probabilities that deviate wildly from the uniform pmf, whereas a bag of state-of-the-art dice used by Las Vegas casinos may have barely perceptible imperfections. We can model the randomness of pmfs with the Dirichlet distribution.[1]

The Dirichlet process is specified by a base distributionH{\displaystyle H} and a positivereal numberα{\displaystyle \alpha } called theconcentration parameter (also known as scaling parameter). The base distribution is theexpected value of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way anormal distribution draws real numbers around its mean. However, even if the base distribution iscontinuous, the distributions drawn from the Dirichlet process arealmost surelydiscrete. The scaling parameter specifies how strong this discretization is: in the limit ofα0{\displaystyle \alpha \rightarrow 0}, the realizations are all concentrated at a single value, while in the limit ofα{\displaystyle \alpha \rightarrow \infty } the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration asα{\displaystyle \alpha } increases.

The Dirichlet process can also be seen as the infinite-dimensional generalization of theDirichlet distribution. In the same way as the Dirichlet distribution is theconjugate prior for thecategorical distribution, the Dirichlet process is the conjugate prior for infinite,nonparametric discrete distributions. A particularly important application of Dirichlet processes is as aprior probability distribution ininfinite mixture models.

The Dirichlet process was formally introduced byThomas S. Ferguson in 1973.[2]It has since been applied indata mining andmachine learning, among others fornatural language processing,computer vision andbioinformatics.

Introduction

[edit]

Dirichlet processes are usually used when modelling data that tends to repeat previous values in a so-called "rich get richer" fashion. Specifically, suppose that the generation of valuesX1,X2,{\displaystyle X_{1},X_{2},\dots } can be simulated by the following algorithm.

Input:H{\displaystyle H} (a probability distribution called base distribution),α{\displaystyle \alpha } (a positive real number calledscaling parameter)
Forn1{\displaystyle n\geq 1}:

a) With probabilityαα+n1{\displaystyle {\frac {\alpha }{\alpha +n-1}}} drawXn{\displaystyle X_{n}} fromH{\displaystyle H}.

b) With probabilitynxα+n1{\displaystyle {\frac {n_{x}}{\alpha +n-1}}} setXn=x{\displaystyle X_{n}=x}, wherenx{\displaystyle n_{x}} is the number of previous observations ofx{\displaystyle x}.
(Formally,nx:=|{j:Xj=x and j<n}|{\displaystyle n_{x}:={\bigl |}\{j:X_{j}=x{\text{ and }}j<n\}{\bigr |}} where||{\displaystyle |\cdot |} denotes the number of elements in the set.)

At the same time, another common model for data is that the observationsX1,X2,{\displaystyle X_{1},X_{2},\dots } are assumed to beindependent and identically distributed (i.i.d.) according to some (random) distributionP{\displaystyle P}. The goal of introducing Dirichlet processes is to be able to describe the procedure outlined above in this i.i.d. model.

TheX1,X2,{\displaystyle X_{1},X_{2},\dots } observations in the algorithm are notindependent, since we have to consider the previous results when generating the next value. They are, however,exchangeable. This fact can be shown by calculating thejoint probability distribution of the observations and noticing that the resulting formula only depends on whichx{\displaystyle x} values occur among the observations and how many repetitions they each have. Because of this exchangeability,de Finetti's representation theorem applies and it implies that the observationsX1,X2,{\displaystyle X_{1},X_{2},\dots } areconditionally independent given a (latent) distributionP{\displaystyle P}. ThisP{\displaystyle P} is a random variable itself and has a distribution. This distribution (over distributions) is called a Dirichlet process (DP{\displaystyle \operatorname {DP} }). In summary, this means that we get an equivalent procedure to the above algorithm:

  1. Draw a distributionP{\displaystyle P} fromDP(H,α){\displaystyle \operatorname {DP} \left(H,\alpha \right)}
  2. Draw observationsX1,X2,{\displaystyle X_{1},X_{2},\dots } independently fromP{\displaystyle P}.

In practice, however, drawing a concrete distributionP{\displaystyle P} is impossible, since its specification requires an infinite amount of information. This is a common phenomenon in the context of Bayesiannon-parametric statistics where a typical task is to learn distributions on function spaces, which involve effectively infinitely many parameters. The key insight is that in many applications the infinite-dimensional distributions appear only as an intermediary computational device and are not required for either the initial specification of prior beliefs or for the statement of the final inference.

Formal definition

[edit]

Given ameasurable setS, a base probability distributionH and a positivereal numberα{\displaystyle \alpha }, the Dirichlet processDP(H,α){\displaystyle \operatorname {DP} (H,\alpha )} is astochastic process whosesample path (orrealization, i.e. an infinite sequence ofrandom variates drawn from the process) is a probability distribution overS, such that the following holds. For any measurable finitepartition ofS, denoted{Bi}i=1n{\displaystyle \{B_{i}\}_{i=1}^{n}},

if XDP(H,α){\displaystyle {\text{if }}X\sim \operatorname {DP} (H,\alpha )}
then (X(B1),,X(Bn))Dir(αH(B1),,αH(Bn)),{\displaystyle {\text{then }}(X(B_{1}),\dots ,X(B_{n}))\sim \operatorname {Dir} (\alpha H(B_{1}),\dots ,\alpha H(B_{n})),}

whereDir{\displaystyle \operatorname {Dir} } denotes theDirichlet distribution and the notationXD{\displaystyle X\sim D} means that the random variableX{\displaystyle X} has the distributionD{\displaystyle D}.

Alternative views

[edit]

There are several equivalent views of the Dirichlet process. Besides the formal definition above, the Dirichlet process can be defined implicitly through de Finetti's theorem as described in the first section; this is often called theChinese restaurant process. A third alternative is thestick-breaking process, which defines the Dirichlet process constructively by writing a distribution sampled from the process asf(x)=k=1βkδxk(x){\displaystyle f(x)=\sum _{k=1}^{\infty }\beta _{k}\delta _{x_{k}}(x)}, where{xk}k=1{\displaystyle \{x_{k}\}_{k=1}^{\infty }} are samples from the base distributionH{\displaystyle H},δxk{\displaystyle \delta _{x_{k}}} is anindicator function centered onxk{\displaystyle x_{k}} (zero everywhere except forδxk(xk)=1{\displaystyle \delta _{x_{k}}(x_{k})=1}) and theβk{\displaystyle \beta _{k}} are defined by a recursive scheme that repeatedly samples from thebeta distributionBeta(1,α){\displaystyle \operatorname {Beta} (1,\alpha )}.

The Chinese restaurant process

[edit]
Main article:Chinese restaurant process
Animation of a Chinese restaurant process with scaling parameterα=0.5{\displaystyle \alpha =0.5}. Tables are hidden once the customers of a table can not be displayed anymore; however, every table has infinitely many seats. (Recording of an interactive animation.[3])

A widely employed metaphor for the Dirichlet process is based on the so-calledChinese restaurant process. The metaphor is as follows:

Imagine a Chinese restaurant in which customers enter. A new customer sits down at a table with a probability proportional to the number of customers already sitting there. Additionally, a customer opens a new table with a probability proportional to the scaling parameterα{\displaystyle \alpha }. After infinitely many customers entered, one obtains a probability distribution over infinitely many tables to be chosen.This probability distribution over the tables is a random sample of the probabilities of observations drawn from a Dirichlet process with scaling parameterα{\displaystyle \alpha }.

If one associates draws from the base measureH{\displaystyle H} with every table, the resulting distribution over the sample spaceS{\displaystyle S} is a random sample of a Dirichlet process.The Chinese restaurant process is related to thePólya urn sampling scheme which yields samples from finite Dirichlet distributions.

Because customers sit at a table with a probability proportional to the number of customers already sitting at the table, two properties of the DP can be deduced:

  1. The Dirichlet process exhibits a self-reinforcing property: The more often a given value has been sampled in the past, the more likely it is to be sampled again.
  2. Even ifH{\displaystyle H} is a distribution over anuncountable set, there is a nonzero probability that two samples will have exactly the same value because the probability mass will concentrate on a small number of tables.

The stick-breaking process

[edit]

A third approach to the Dirichlet process is the so-called stick-breaking process view. Conceptually, this involves repeatedly breaking off and discarding a random fraction (sampled from a Beta distribution) of a "stick" that is initially of length 1. Remember that draws from a Dirichlet process are distributions over a setS{\displaystyle S}. As noted previously, the distribution drawn is discrete with probability 1. In the stick-breaking process view, we explicitly use the discreteness and give theprobability mass function of this (random) discrete distribution as:

f(θ)=k=1βkδθk(θ){\displaystyle f(\theta )=\sum _{k=1}^{\infty }\beta _{k}\cdot \delta _{\theta _{k}}(\theta )}

whereδθk{\displaystyle \delta _{\theta _{k}}} is theindicator function which evaluates to zero everywhere, except forδθk(θk)=1{\displaystyle \delta _{\theta _{k}}(\theta _{k})=1}. Since this distribution is random itself, its mass function is parameterized by two sets of random variables: the locations{θk}k=1{\displaystyle \left\{\theta _{k}\right\}_{k=1}^{\infty }} and the corresponding probabilities{βk}k=1{\displaystyle \left\{\beta _{k}\right\}_{k=1}^{\infty }}. In the following, we present without proof what these random variables are.

The locationsθk{\displaystyle \theta _{k}} are independent and identically distributed according toH{\displaystyle H}, the base distribution of the Dirichlet process. The probabilitiesβk{\displaystyle \beta _{k}} are given by a procedure resembling the breaking of a unit-length stick (hence the name):

βk=βki=1k1(1βi){\displaystyle \beta _{k}=\beta '_{k}\cdot \prod _{i=1}^{k-1}\left(1-\beta '_{i}\right)}

whereβk{\displaystyle \beta '_{k}} are independent random variables with thebeta distributionBeta(1,α){\displaystyle \operatorname {Beta} (1,\alpha )}. The resemblance to 'stick-breaking' can be seen by consideringβk{\displaystyle \beta _{k}} as the length of a piece of a stick. We start with a unit-length stick and in each step we break off a portion of the remaining stick according toβk{\displaystyle \beta '_{k}} and assign this broken-off piece toβk{\displaystyle \beta _{k}}. The formula can be understood by noting that after the firstk − 1 values have their portions assigned, the length of the remainder of the stick isi=1k1(1βi){\displaystyle \prod _{i=1}^{k-1}\left(1-\beta '_{i}\right)} and this piece is broken according toβk{\displaystyle \beta '_{k}} and gets assigned toβk{\displaystyle \beta _{k}}.

The smallerα{\displaystyle \alpha } is, the less of the stick will be left for subsequent values (on average), yielding more concentrated distributions.

The stick-breaking process is similar to the construction where one samples sequentially frommarginal beta distributions in order to generate a sample from aDirichlet distribution.[4]

The Pólya urn scheme

[edit]

Yet another way to visualize the Dirichlet process and Chinese restaurant process is as a modifiedPólya urn scheme sometimes called theBlackwell–MacQueen sampling scheme. Imagine that we start with an urn filled withα{\displaystyle \alpha } black balls. Then we proceed as follows:

  1. Each time we need an observation, we draw a ball from the urn.
  2. If the ball is black, we generate a new (non-black) colour uniformly, label a new ball this colour, drop the new ball into the urn along with the ball we drew, and return the colour we generated.
  3. Otherwise, label a new ball with the colour of the ball we drew, drop the new ball into the urn along with the ball we drew, and return the colour we observed.

The resulting distribution over colours is the same as the distribution over tables in the Chinese restaurant process. Furthermore, when we draw a black ball, if rather than generating a new colour, we instead pick a random value from a base distributionH{\displaystyle H} and use that value to label the new ball, the resulting distribution over labels will be the same as the distribution over the values in a Dirichlet process.

Use as a prior distribution

[edit]

The Dirichlet Process can be used as a prior distribution to estimate the probability distribution that generates the data. In this section, we consider the model

PDP(H,α)X1,,XnPi.i.d.P.{\displaystyle {\begin{aligned}P&\sim {\textrm {DP}}(H,\alpha )\\X_{1},\ldots ,X_{n}\mid P&\,{\overset {\textrm {i.i.d.}}{\sim }}\,P.\end{aligned}}}

The Dirichlet Process distribution satisfiesprior conjugacy, posterior consistency, and theBernstein–von Mises theorem.[5]

Prior conjugacy

[edit]

In this model, the posterior distribution is again a Dirichlet process. This means that the Dirichlet process is aconjugate prior for this model. Theposterior distribution is given by

PX1,,XnDP(αα+nH+1α+ni=1nδXi,α+n)=DP(αα+nH+nα+nPn,α+n){\displaystyle {\begin{aligned}P\mid X_{1},\ldots ,X_{n}&\sim {\textrm {DP}}\left({\frac {\alpha }{\alpha +n}}H+{\frac {1}{\alpha +n}}\sum _{i=1}^{n}\delta _{X_{i}},\;\alpha +n\right)\\&={\textrm {DP}}\left({\frac {\alpha }{\alpha +n}}H+{\frac {n}{\alpha +n}}\mathbb {P} _{n},\;\alpha +n\right)\end{aligned}}}

wherePn{\displaystyle \mathbb {P} _{n}} is defined below.

Posterior consistency

[edit]

If we take thefrequentist view of probability, we believe there is a true probability distributionP0{\displaystyle P_{0}} that generated the data. Then it turns out that the Dirichlet process is consistent in theweak topology, which means that for every weak neighbourhoodU{\displaystyle U} ofP0{\displaystyle P_{0}}, the posterior probability ofU{\displaystyle U} converges to1{\displaystyle 1}.

Bernstein–Von Mises theorem

[edit]

In order to interpret the credible sets as confidence sets, aBernstein–von Mises theorem is needed. In case of the Dirichlet process we compare the posterior distribution with theempirical processPn=1ni=1nδXi{\displaystyle \mathbb {P} _{n}={\frac {1}{n}}\sum _{i=1}^{n}\delta _{X_{i}}}. SupposeF{\displaystyle {\mathcal {F}}} is aP0{\displaystyle P_{0}}-Donsker class, i.e.

n(PnP0)GP0{\displaystyle {\sqrt {n}}\left(\mathbb {P} _{n}-P_{0}\right)\rightsquigarrow G_{P_{0}}}

for some Brownian BridgeGP0{\displaystyle G_{P_{0}}}. Suppose also that there exists a functionF{\displaystyle F} such thatF(x)supfFf(x){\displaystyle F(x)\geq \sup _{f\in {\mathcal {F}}}f(x)} such thatF2dH<{\displaystyle \int F^{2}\,\mathrm {d} H<\infty }, then,P0{\displaystyle P_{0}}almost surely

n(PPn)X1,,XnGP0.{\displaystyle {\sqrt {n}}\left(P-\mathbb {P} _{n}\right)\mid X_{1},\cdots ,X_{n}\rightsquigarrow G_{P_{0}}.}

This implies that credible sets you construct are asymptotic confidence sets, and the Bayesian inference based on the Dirichlet process is asymptotically also valid frequentist inference.

Use in Dirichlet mixture models

[edit]
Simulation of 1000 observations drawn from a Dirichlet mixture model. Each observation within a cluster is drawn independently from themultivariate normal distributionN(μk,1/4){\displaystyle N(\mu _{k},1/4)}. The cluster meansμk{\displaystyle \mu _{k}} are drawn from a distribution G which itself is drawn from a Dirichlet process with concentration parameterα=0.5{\displaystyle \alpha =0.5} and base distributionH=N(2,16){\displaystyle H=N(2,16)}. Each row is a new simulation.

To understand what Dirichlet processes are and the problem they solve we consider the example ofdata clustering. It is a common situation that data points are assumed to be distributed in a hierarchical fashion where each data point belongs to a (randomly chosen) cluster and the members of a cluster are further distributed randomly within that cluster.

Example 1

[edit]

For example, we might be interested in how people will vote on a number of questions in an upcoming election. A reasonable model for this situation might be to classify each voter as a liberal, a conservative or a moderate and then model the event that a voter says "Yes" to any particular question as aBernoulli random variable with the probability dependent on which political cluster they belong to. By looking at how votes were cast in previous years on similar pieces of legislation one could fit a predictive model using a simple clustering algorithm such ask-means. That algorithm, however, requires knowing in advance the number of clusters that generated the data. In many situations, it is not possible to determine this ahead of time, and even when we can reasonably assume a number of clusters we would still like to be able to check this assumption. For example, in the voting example above the division into liberal, conservative and moderate might not be finely tuned enough; attributes such as a religion, class or race could also be critical for modelling voter behaviour, resulting in more clusters in the model.

Example 2

[edit]

As another example, we might be interested in modelling the velocities of galaxies using a simple model assuming that the velocities are clustered, for instance by assuming each velocity is distributed according to thenormal distributionviN(μk,σ2){\displaystyle v_{i}\sim N(\mu _{k},\sigma ^{2})}, where thei{\displaystyle i}th observation belongs to thek{\displaystyle k}th cluster of galaxies with common expected velocity. In this case it is far from obvious how to determine a priori how many clusters (of common velocities) there should be and any model for this would be highly suspect and should be checked against the data. By using a Dirichlet process prior for the distribution of cluster means we circumvent the need to explicitly specify ahead of time how many clusters there are, although the concentration parameter still controls it implicitly.

We consider this example in more detail. A first naive model is to presuppose that there areK{\displaystyle K} clusters of normally distributed velocities with common known fixedvarianceσ2{\displaystyle \sigma ^{2}}. Denoting the event that thei{\displaystyle i}th observation is in thek{\displaystyle k}th cluster aszi=k{\displaystyle z_{i}=k} we can write this model as:

(vizi=k,μk)N(μk,σ2)P(zi=k)=πk(πα)Dir(αK1K)μkH(λ){\displaystyle {\begin{aligned}(v_{i}\mid z_{i}=k,\mu _{k})&\sim N(\mu _{k},\sigma ^{2})\\\operatorname {P} (z_{i}=k)&=\pi _{k}\\({\boldsymbol {\pi }}\mid \alpha )&\sim \operatorname {Dir} \left({\frac {\alpha }{K}}\cdot \mathbf {1} _{K}\right)\\\mu _{k}&\sim H(\lambda )\end{aligned}}}

That is, we assume that the data belongs toK{\displaystyle K} distinct clusters with meansμk{\displaystyle \mu _{k}} and thatπk{\displaystyle \pi _{k}} is the (unknown) prior probability of a data point belonging to thek{\displaystyle k}th cluster. We assume that we have no initial information distinguishing the clusters, which is captured by the symmetric priorDir(α/K1K){\displaystyle \operatorname {Dir} \left(\alpha /K\cdot \mathbf {1} _{K}\right)}. HereDir{\displaystyle \operatorname {Dir} } denotes theDirichlet distribution and1K{\displaystyle \mathbf {1} _{K}} denotes a vector of lengthK{\displaystyle K} where each element is 1. We further assign independent and identical prior distributionsH(λ){\displaystyle H(\lambda )} to each of the cluster means, whereH{\displaystyle H} may be any parametric distribution with parameters denoted asλ{\displaystyle \lambda }. The hyper-parametersα{\displaystyle \alpha } andλ{\displaystyle \lambda } are taken to be known fixed constants, chosen to reflect our prior beliefs about the system. To understand the connection to Dirichlet process priors we rewrite this model in an equivalent but more suggestive form:

(viμ~i)N(μ~i,σ2)μ~iG=k=1Kπkδμk(μ~i)(πα)Dir(αK1K)μkH(λ){\displaystyle {\begin{aligned}(v_{i}\mid {\tilde {\mu }}_{i})&\sim N({\tilde {\mu }}_{i},\sigma ^{2})\\{\tilde {\mu }}_{i}&\sim G=\sum _{k=1}^{K}\pi _{k}\delta _{\mu _{k}}({\tilde {\mu }}_{i})\\({\boldsymbol {\pi }}\mid \alpha )&\sim \operatorname {Dir} \left({\frac {\alpha }{K}}\cdot \mathbf {1} _{K}\right)\\\mu _{k}&\sim H(\lambda )\end{aligned}}}

Instead of imagining that each data point is first assigned a cluster and then drawn from the distribution associated to that cluster we now think of each observation being associated with parameterμ~i{\displaystyle {\tilde {\mu }}_{i}} drawn from some discrete distributionG{\displaystyle G} with support on theK{\displaystyle K} means. That is, we are now treating theμ~i{\displaystyle {\tilde {\mu }}_{i}} as being drawn from the random distributionG{\displaystyle G} and our prior information is incorporated into the model by the distribution over distributionsG{\displaystyle G}.

Animation of the clustering process for one-dimensional data using Gaussian distributions drawn from a Dirichlet process. The histograms of the clusters are shown in different colours. During the parameter estimation process, new clusters are created and grow on the data. The legend shows the cluster colours and the number of datapoints assigned to each cluster.

We would now like to extend this model to work without pre-specifying a fixed number of clustersK{\displaystyle K}. Mathematically, this means we would like to select a random prior distributionG(μ~i)=k=1πkδμk(μ~i){\displaystyle G({\tilde {\mu }}_{i})=\sum _{k=1}^{\infty }\pi _{k}\delta _{\mu _{k}}({\tilde {\mu }}_{i})} where the values of the clusters meansμk{\displaystyle \mu _{k}} are again independently distributed according toH(λ){\displaystyle H\left(\lambda \right)} and the distribution overπk{\displaystyle \pi _{k}} is symmetric over the infinite set of clusters. This is exactly what is accomplished by the model:

(viμ~i)N(μ~i,σ2)μ~iGGDP(H(λ),α){\displaystyle {\begin{aligned}(v_{i}\mid {\tilde {\mu }}_{i})&\sim N({\tilde {\mu }}_{i},\sigma ^{2})\\{\tilde {\mu }}_{i}&\sim G\\G&\sim \operatorname {DP} (H(\lambda ),\alpha )\end{aligned}}}

With this in hand we can better understand the computational merits of the Dirichlet process. Suppose that we wanted to drawn{\displaystyle n} observations from the naive model with exactlyK{\displaystyle K} clusters. A simple algorithm for doing this would be to drawK{\displaystyle K} values ofμk{\displaystyle \mu _{k}} fromH(λ){\displaystyle H(\lambda )}, a distributionπ{\displaystyle \pi } fromDir(α/K1K){\displaystyle \operatorname {Dir} \left(\alpha /K\cdot \mathbf {1} _{K}\right)} and then for each observation independently sample the clusterk{\displaystyle k} with probabilityπk{\displaystyle \pi _{k}} and the value of the observation according toN(μk,σ2){\displaystyle N\left(\mu _{k},\sigma ^{2}\right)}. It is easy to see that this algorithm does not work in case where we allow infinite clusters because this would require sampling an infinite dimensional parameterπ{\displaystyle {\boldsymbol {\pi }}}. However, it is still possible to sample observationsvi{\displaystyle v_{i}}. One can e.g. use the Chinese restaurant representation described below and calculate the probability for used clusters and a new cluster to be created. This avoids having to explicitly specifyπ{\displaystyle {\boldsymbol {\pi }}}. Other solutions are based on a truncation of clusters: A (high) upper bound to the true number of clusters is introduced and cluster numbers higher than the lower bound are treated as one cluster.

Fitting the model described above based on observed dataD{\displaystyle D} means finding theposterior distributionp(π,μD){\displaystyle p\left({\boldsymbol {\pi }},{\boldsymbol {\mu }}\mid D\right)} over cluster probabilities and their associated means. In the infinite dimensional case it is obviously impossible to write down the posterior explicitly. It is, however, possible to draw samples from this posterior using a modifiedGibbs sampler.[6] This is the critical fact that makes the Dirichlet process prior useful forinference.

Applications of the Dirichlet process

[edit]

Dirichlet processes are frequently used inBayesiannonparametric statistics. "Nonparametric" here does not mean a parameter-less model, rather a model in which representations grow as more data are observed. Bayesian nonparametric models have gained considerable popularity in the field ofmachine learning because of the above-mentioned flexibility, especially inunsupervised learning. In a Bayesian nonparametric model, the prior and posterior distributions are not parametric distributions, but stochastic processes.[7] The fact that the Dirichlet distribution is a probability distribution on thesimplex of sets of non-negative numbers that sum to one makes it a good candidate to model distributions over distributions or distributions over functions. Additionally, the nonparametric nature of this model makes it an ideal candidate for clustering problems where the distinct number of clusters is unknown beforehand. In addition, the Dirichlet process has also been used for developing a mixture of expert models, in the context of supervised learning algorithms (regression or classification settings). For instance, mixtures of Gaussian process experts, where the number of required experts must be inferred from the data.[8][9]

As draws from a Dirichlet process are discrete, an important use is as aprior probability ininfinite mixture models. In this case,S{\displaystyle S} is the parametric set of component distributions. The generative process is therefore that a sample is drawn from a Dirichlet process, and for each data point, in turn, a value is drawn from this sample distribution and used as the component distribution for that data point. The fact that there is no limit to the number of distinct components which may be generated makes this kind of model appropriate for the case when the number of mixture components is not well-defined in advance. For example, the infinite mixture of Gaussians model,[10] as well as associated mixture regression models, e.g.[11]

The infinite nature of these models also lends them tonatural language processing applications, where it is often desirable to treat the vocabulary as an infinite, discrete set.

The Dirichlet Process can also be used for nonparametric hypothesis testing, i.e. to develop Bayesian nonparametric versions of the classical nonparametric hypothesis tests, e.g.sign test,Wilcoxon rank-sum test,Wilcoxon signed-rank test, etc.For instance, Bayesian nonparametric versions of the Wilcoxon rank-sum test and the Wilcoxon signed-rank test have been developed by using theimprecise Dirichlet process, a prior ignorance Dirichlet process.[citation needed]

Related distributions

[edit]

References

[edit]
  1. ^Frigyik, Bela A.; Kapila, Amol; Gupta, Maya R."Introduction to the Dirichlet Distribution and Related Processes"(PDF). Retrieved2 September 2021.
  2. ^Ferguson, Thomas (1973)."Bayesian analysis of some nonparametric problems".Annals of Statistics.1 (2):209–230.doi:10.1214/aos/1176342360.MR 0350949.
  3. ^"Dirichlet Process and Dirichlet Distribution – Polya Restaurant Scheme and Chinese Restaurant Process".
  4. ^For the proof, seePaisley, John (August 2010)."A simple proof of the stick-breaking construction of the Dirichlet Process"(PDF).Columbia University. Archived fromthe original(PDF) on January 22, 2015.
  5. ^Aad van der Vaart, Subhashis Ghosal (2017).Fundamentals of Bayesian Nonparametric Inference. Cambridge University Press.ISBN 978-0-521-87826-5.
  6. ^Sudderth, Erik (2006).Graphical Models for Visual Object Recognition and Tracking(PDF) (Ph.D.). MIT Press.
  7. ^Nils Lid Hjort; Chris Holmes, Peter Müller; Stephen G. Walker (2010).Bayesian Nonparametrics. Cambridge University Press.ISBN 978-0-521-51346-3.
  8. ^Sotirios P. Chatzis, "A Latent Variable Gaussian Process Model with Pitman-Yor Process Priors for Multiclass Classification," Neurocomputing, vol. 120, pp. 482–489, Nov. 2013.doi:10.1016/j.neucom.2013.04.029
  9. ^Sotirios P. Chatzis, Yiannis Demiris, "Nonparametric mixtures of Gaussian processes with power-law behaviour," IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 12, pp. 1862–1871, Dec. 2012.doi:10.1109/TNNLS.2012.2217986
  10. ^Rasmussen, Carl (2000)."The Infinite Gaussian Mixture Model"(PDF).Advances in Neural Information Processing Systems.12:554–560.
  11. ^Sotirios P. Chatzis, Dimitrios Korkinof, and Yiannis Demiris, "A nonparametric Bayesian approach toward robot learning by demonstration," Robotics and Autonomous Systems, vol. 60, no. 6, pp. 789–802, June 2012.doi:10.1016/j.robot.2012.02.005

External links

[edit]
Discrete time
Continuous time
Both
Fields and other
Time series models
Financial models
Actuarial models
Queueing models
Properties
Limit theorems
Inequalities
Tools
Disciplines
Retrieved from "https://en.wikipedia.org/w/index.php?title=Dirichlet_process&oldid=1311311950"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp