Movatterモバイル変換

[0]ホーム

Jump to content

Mixture model

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromMixture models)

Statistical concept

Not to be confused withmixed model.

Structure

[edit]

General mixture model

[edit]

A typical finite-dimensional mixture model is ahierarchical model consisting of the following components:

N random variables that are observed, each distributed according to a mixture ofK components, with the components belonging to the sameparametric family of distributions (e.g., allnormal, allZipfian, etc.) but with different parameters. However, it is also possible to have a finite mixture model where each component belongs to a differentparametric family of distributions,^[1] for example, a mixture of amultivariate normal distribution and ageneralized hyperbolic distribution.

N randomlatent variables specifying the identity of the mixture component of each observation, each distributed according to aK-dimensionalcategorical distribution
A set ofK mixture weights, which are probabilities that sum to 1.
A set ofK parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, if the mixture components areGaussian distributions, there will be amean andvariance for each component. If the mixture components arecategorical distributions (e.g., when each observation is a token from a finite alphabet of sizeV), there will be a vector ofV probabilities summing to 1.

In addition, in aBayesian setting, the mixture weights and parameters will themselves be random variables, andprior distributions will be placed over the variables. In such a case, the weights are typically viewed as aK-dimensional random vector drawn from aDirichlet distribution (theconjugate prior of the categorical distribution), and the parameters will be distributed according to their respective conjugate priors.

Mathematically, a basic parametric mixture model can be described as follows:

{\begin{array}{lcl}K&=&{\text{number of mixture components}}\\N&=&{\text{number of observations}}\\\theta _{i=1\dots K}&=&{\text{parameter of distribution of observation associated with component }}i\\\phi _{i=1\dots K}&=&{\text{mixture weight, i.e., prior probability of a particular component }}i\\{\boldsymbol {\phi }}&=&K{\text{-dimensional vector composed of all the individual }}\phi _{1\dots K}{\text{; must sum to 1}}\\z_{i=1\dots N}&=&{\text{component of observation }}i\\x_{i=1\dots N}&=&{\text{observation }}i\\F(x|\theta )&=&{\text{probability distribution of an observation, parametrized on }}\theta \\z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}|z_{i=1\dots N}&\sim &F(\theta _{z_{i}})\end{array}}

In a Bayesian setting, all parameters are associated with random variables, as follows:

{\begin{array}{lcl}K,N&=&{\text{as above}}\\\theta _{i=1\dots K},\phi _{i=1\dots K},{\boldsymbol {\phi }}&=&{\text{as above}}\\z_{i=1\dots N},x_{i=1\dots N},F(x|\theta )&=&{\text{as above}}\\\alpha &=&{\text{shared hyperparameter for component parameters}}\\\beta &=&{\text{shared hyperparameter for mixture weights}}\\H(\theta |\alpha )&=&{\text{prior probability distribution of component parameters, parametrized on }}\alpha \\\theta _{i=1\dots K}&\sim &H(\theta |\alpha )\\{\boldsymbol {\phi }}&\sim &\operatorname {Symmetric-Dirichlet} _{K}(\beta )\\z_{i=1\dots N}|{\boldsymbol {\phi }}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}|z_{i=1\dots N},\theta _{i=1\dots K}&\sim &F(\theta _{z_{i}})\end{array}}

This characterization usesF andH to describe arbitrary distributions over observations and parameters, respectively. TypicallyH will be theconjugate prior ofF. The two most common choices ofF areGaussian aka "normal" (for real-valued observations) andcategorical (for discrete observations). Other common possibilities for the distribution of the mixture components are:

Binomial distribution, for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences
Multinomial distribution, similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey)
Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
Poisson distribution, for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence
Exponential distribution, for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence
Log-normal distribution, for positive real numbers that are assumed to grow exponentially, such as incomes or prices
Multivariate normal distribution (aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed
Multivariate Student'st-distribution, for vectors of heavy-tailed correlated outcomes^[2]
A vector ofBernoulli-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below

Specific examples

[edit]

Gaussian mixture model

[edit]

A typical non-BayesianGaussian mixture model looks like this:

{\begin{array}{lcl}K,N&=&{\text{as above}}\\\phi _{i=1\dots K},{\boldsymbol {\phi }}&=&{\text{as above}}\\z_{i=1\dots N},x_{i=1\dots N}&=&{\text{as above}}\\\theta _{i=1\dots K}&=&\{\mu _{i=1\dots K},\sigma _{i=1\dots K}^{2}\}\\\mu _{i=1\dots K}&=&{\text{mean of component }}i\\\sigma _{i=1\dots K}^{2}&=&{\text{variance of component }}i\\z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}&\sim &{\mathcal {N}}(\mu _{z_{i}},\sigma _{z_{i}}^{2})\end{array}}

Bayesian Gaussian mixture model usingplate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of sizeK.

A Bayesian version of aGaussian mixture model is as follows:

{\begin{array}{lcl}K,N&=&{\text{as above}}\\\phi _{i=1\dots K},{\boldsymbol {\phi }}&=&{\text{as above}}\\z_{i=1\dots N},x_{i=1\dots N}&=&{\text{as above}}\\\theta _{i=1\dots K}&=&\{\mu _{i=1\dots K},\sigma _{i=1\dots K}^{2}\}\\\mu _{i=1\dots K}&=&{\text{mean of component }}i\\\sigma _{i=1\dots K}^{2}&=&{\text{variance of component }}i\\\mu _{0},\lambda ,\nu ,\sigma _{0}^{2}&=&{\text{shared hyperparameters}}\\\mu _{i=1\dots K}&\sim &{\mathcal {N}}(\mu _{0},\lambda \sigma _{i}^{2})\\\sigma _{i=1\dots K}^{2}&\sim &\operatorname {Inverse-Gamma} (\nu ,\sigma _{0}^{2})\\{\boldsymbol {\phi }}&\sim &\operatorname {Symmetric-Dirichlet} _{K}(\beta )\\z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}&\sim &{\mathcal {N}}(\mu _{z_{i}},\sigma _{z_{i}}^{2})\end{array}}

{\displaystyle }

Animation of the clustering process for one-dimensional data using a Bayesian Gaussian mixture model where normal distributions are drawn from aDirichlet process. The histograms of the clusters are shown in different colours. During the parameter estimation process, new clusters are created and grow on the data. The legend shows the cluster colours and the number of datapoints assigned to each cluster.

Multivariate Gaussian mixture model

[edit]

A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters (denoted in bold), or multivariate normal distributions. In a multivariate distribution (i.e. one modelling a vector ${\boldsymbol {x}}$ withN random variables) one may model a vector of parameters (such as several observations of a signal or patches within an image) using a Gaussian mixture model prior distribution on the vector of estimates given by $p({\boldsymbol {\theta }})=\sum _{i=1}^{K}\phi _{i}{\mathcal {N}}({\boldsymbol {\mu }}_{i},{\boldsymbol {\Sigma }}_{i})$ where thei^th vector component is characterized by normal distributions with weights $\phi _{i}$ , means ${\boldsymbol {\mu }}_{i}$ and covariance matrices ${\boldsymbol {\Sigma }}_{i}$ . To incorporate this prior into a Bayesian estimation, the prior is multiplied with the known distribution $p({\boldsymbol {x|\theta }})$ of the data ${\boldsymbol {x}}$ conditioned on the parameters ${\boldsymbol {\theta }}$ to be estimated. With this formulation, theposterior distribution $p({\boldsymbol {\theta |x}})$ isalso a Gaussian mixture model of the form $p({\boldsymbol {\theta |x}})=\sum _{i=1}^{K}{\tilde {\phi }}_{i}{\mathcal {N}}({\boldsymbol {{\tilde {\mu }}_{i}}},{\boldsymbol {\tilde {\Sigma }}}_{i})$ with new parameters ${\tilde {\phi }}_{i},{\boldsymbol {\tilde {\mu }}}_{i}$ and ${\boldsymbol {\tilde {\Sigma }}}_{i}$ that are updated using theEM algorithm.^[3] Although EM-based parameter updates are well-established, providing the initial estimates for these parameters is currently an area of active research. Note that this formulation yields a closed-form solution to the complete posterior distribution. Estimations of the random variable ${\boldsymbol {\theta }}$ may be obtained via one of several estimators, such as the mean or maximum of the posterior distribution.

Such distributions are useful for assuming patch-wise shapes of images and clusters, for example. In the case of image representation, each Gaussian may be tilted, expanded, and warped according to the covariance matrices ${\boldsymbol {\Sigma }}_{i}$ . One Gaussian distribution of the set is fit to each patch (usually of size 8×8 pixels) in the image. Notably, any distribution of points around a cluster (seek-means) may be accurately given enough Gaussian components, but scarcely overK=20 components are needed to accurately model a given image distribution or cluster of data.

Categorical mixture model

[edit]

A typical non-Bayesian mixture model withcategorical observations looks like this:

$K, N : {\displaystyle K,N:}$ as above
$\phi _{i=1\dots K},{\boldsymbol {\phi }}:$ as above
$z_{i=1\dots N},x_{i=1\dots N}:$ as above
$V : {\displaystyle V:}$ dimension of categorical observations, e.g., size of word vocabulary
$\theta _{i=1\dots K,j=1\dots V}:$ probability for component $i {\displaystyle i}$ of observing item $j {\displaystyle j}$
${\boldsymbol {\theta }}_{i=1\dots K}:$ vector of dimension $V, {\displaystyle V,}$ composed of $\theta _{i,1\dots V};$ must sum to 1

The random variables:

{\begin{array}{lcl}z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}&\sim &{\text{Categorical}}({\boldsymbol {\theta }}_{z_{i}})\end{array}}

Bayesian categorical mixture model usingplate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of sizeK; likewise for [V].

A typical Bayesian mixture model withcategorical observations looks like this:

$K, N : {\displaystyle K,N:}$ as above
$\phi _{i=1\dots K},{\boldsymbol {\phi }}:$ as above
$z_{i=1\dots N},x_{i=1\dots N}:$ as above
$V : {\displaystyle V:}$ dimension of categorical observations, e.g., size of word vocabulary
$\theta _{i=1\dots K,j=1\dots V}:$ probability for component $i {\displaystyle i}$ of observing item $j {\displaystyle j}$
${\boldsymbol {\theta }}_{i=1\dots K}:$ vector of dimension $V, {\displaystyle V,}$ composed of $\theta _{i,1\dots V};$ must sum to 1
$\alpha :$ shared concentration hyperparameter of ${\boldsymbol {\theta }}$ for each component
$\beta :$ concentration hyperparameter of ${\boldsymbol {\phi }}$

The random variables:

{\begin{array}{lcl}{\boldsymbol {\phi }}&\sim &\operatorname {Symmetric-Dirichlet} _{K}(\beta )\\{\boldsymbol {\theta }}_{i=1\dots K}&\sim &{\text{Symmetric-Dirichlet}}_{V}(\alpha )\\z_{i=1\dots N}&\sim &\operatorname {Categorical} ({\boldsymbol {\phi }})\\x_{i=1\dots N}&\sim &{\text{Categorical}}({\boldsymbol {\theta }}_{z_{i}})\end{array}}

Examples

[edit]

A financial model

[edit]

Thenormal distribution plotted with different means and variances

Financial returns often behave differently in normal situations and during crisis times. A mixture model^[4] for return data seems reasonable. Sometimes the model used is ajump-diffusion model, or as a mixture of two normal distributions. SeeFinancial economics § Challenges and criticism andFinancial risk management § Banking for further context.

House prices

[edit]

Assume that we observe the prices ofN different houses. Different types of houses in different neighborhoods will have vastly different prices, but the price of a particular type of house in a particular neighborhood (e.g., three-bedroom house in moderately upscale neighborhood) will tend to cluster fairly closely around the mean. One possible model of such prices would be to assume that the prices are accurately described by a mixture model withK different components, each distributed as anormal distribution with unknown mean and variance, with each component specifying a particular combination of house type/neighborhood. Fitting this model to observed prices, e.g., using theexpectation-maximization algorithm, would tend to cluster the prices according to house type/neighborhood and reveal the spread of prices in each type/neighborhood. (Note that for values such as prices or incomes that are guaranteed to be positive and which tend to growexponentially, alog-normal distribution might actually be a better model than a normal distribution.)

Topics in a document

[edit]

Assume that a document is composed ofN different words from a total vocabulary of sizeV, where each word corresponds to one ofK possible topics. The distribution of such words could be modelled as a mixture ofK differentV-dimensionalcategorical distributions. A model of this sort is commonly termed atopic model. Note thatexpectation maximization applied to such a model will typically fail to produce realistic results, due (among other things) to theexcessive number of parameters. Some sorts of additional assumptions are typically necessary to get good results. Typically two sorts of additional components are added to the model:

Aprior distribution is placed over the parameters describing the topic distributions, using aDirichlet distribution with aconcentration parameter that is set significantly below 1, so as to encourage sparse distributions (where only a small number of words have significantly non-zero probabilities).
Some sort of additional constraint is placed over the topic identities of words, to take advantage of natural clustering.
- For example, aMarkov chain could be placed on the topic identities (i.e., the latent variables specifying the mixture component of each observation), corresponding to the fact that nearby words belong to similar topics. (This results in ahidden Markov model, specifically one where aprior distribution is placed over state transitions that favors transitions that stay in the same state.)
- Another possibility is thelatent Dirichlet allocation model, which divides up the words intoD different documents and assumes that in each document only a small number of topics occur with any frequency.

Handwriting recognition

[edit]

The following example is based on an example inChristopher M. Bishop,Pattern Recognition and Machine Learning.^[5]

Imagine that we are given anN×N black-and-white image that is known to be a scan of a hand-written digit between 0 and 9, but we don't know which digit is written. We can create a mixture model with $K=10$ different components, where each component is a vector of size $N^{2}$ ofBernoulli distributions (one per pixel). Such a model can be trained with theexpectation-maximization algorithm on an unlabeled set of hand-written digits, and will effectively cluster the images according to the digit being written. The same model could then be used to recognize the digit of another image simply by holding the parameters constant, computing the probability of the new image for each possible digit (a trivial calculation), and returning the digit that generated the highest probability.

Assessing projectile accuracy (a.k.a. circular error probable, CEP)

[edit]

Mixture models apply in the problem of directing multiple projectiles at a target (as in air, land, or sea defense applications), where the physical and/or statistical characteristics of the projectiles differ within the multiple projectiles. An example might be shots from multiple munitions types or shots from multiple locations directed at one target. The combination of projectile types may be characterized as a Gaussian mixture model.^[6] Further, a well-known measure of accuracy for a group of projectiles is thecircular error probable (CEP), which is the numberR such that, on average, half of the group of projectiles falls within the circle of radiusR about the target point. The mixture model can be used to determine (or estimate) the valueR. The mixture model properly captures the different types of projectiles.

Direct and indirect applications

[edit]

The financial example above is one direct application of the mixture model, a situation in which we assume an underlying mechanism so that each observation belongs to one of some number of different sources or categories. This underlying mechanism may or may not, however, be observable. In this form of mixture, each of the sources is described by a component probability density function, and its mixture weight is the probability that an observation comes from this component.

In an indirect application of the mixture model we do not assume such a mechanism. The mixture model is simply used for its mathematical flexibilities. For example, a mixture of twonormal distributions with different means may result in a density with twomodes, which is not modeled by standard parametric distributions. Another example is given by the possibility of mixture distributions to model fatter tails than the basic Gaussian ones, so as to be a candidate for modeling more extreme events.

Predictive Maintenance

[edit]

The mixture model-based clustering is also predominantly used in identifying the state of the machine inpredictive maintenance. Density plots are used to analyze the density of high dimensional features. If multi-model densities are observed, then it is assumed that a finite set of densities are formed by a finite set of normal mixtures. A multivariate Gaussian mixture model is used to cluster the feature data into k number of groups where k represents each state of the machine. The machine state can be a normal state, power off state, or faulty state.^[7] Each formed cluster can be diagnosed using techniques such as spectral analysis. In the recent years, this has also been widely used in other areas such as early fault detection.^[8]

Fuzzy image segmentation

[edit]

An example of Gaussian Mixture in image segmentation with grey histogram

In image processing and computer vision, traditionalimage segmentation models often assign to onepixel only one exclusive pattern. In fuzzy or soft segmentation, any pattern can have certain "ownership" over any single pixel. If the patterns are Gaussian, fuzzy segmentation naturally results in Gaussian mixtures. Combined with other analytic or geometric tools (e.g., phase transitions over diffusive boundaries), such spatially regularized mixture models could lead to more realistic and computationally efficient segmentation methods.^[9]

Point set registration

[edit]

Probabilistic mixture models such asGaussian mixture models (GMM) are used to resolvepoint set registration problems in image processing and computer vision fields. For pair-wisepoint set registration, one point set is regarded as the centroids of mixture models, and the other point set is regarded as data points (observations). State-of-the-art methods are e.g.coherent point drift (CPD)^[10] andStudent's t-distribution mixture models (TMM).^[11] The result of recent research demonstrate the superiority of hybrid mixture models^[12] (e.g. combining Student's t-distribution and Watson distribution/Bingham distribution to model spatial positions and axes orientations separately) compare to CPD and TMM, in terms of inherent robustness, accuracy and discriminative capacity.

Identifiability

[edit]

Identifiability refers to the existence of a unique characterization for any one of the models in the class (family) being considered. Estimation procedures may not be well-defined and asymptotic theory may not hold if a model is not identifiable.

Example

[edit]

LetJ be the class of all binomial distributions withn = 2. Then a mixture of two members ofJ would have

${\begin{aligned}p_{0}&=\pi {\left(1-\theta _{1}\right)}^{2}+\left(1-\pi \right){\left(1-\theta _{2}\right)}^{2}\\[1ex]p_{1}&=2\pi \theta _{1}\left(1-\theta _{1}\right)+2\left(1-\pi \right)\theta _{2}\left(1-\theta _{2}\right)\end{aligned}}$

andp₂ = 1 −p₀ −p₁. Clearly, givenp₀ andp₁, it is not possible to determine the above mixture model uniquely, as there are three parameters(π,θ₁,θ₂) to be determined.

Definition

[edit]

Consider a mixture of parametric distributions of the same class. Let

$J=\{f(\cdot ;\theta ):\theta \in \Omega \}$

be the class of all component distributions. Then theconvex hullK ofJ defines the class of all finite mixture of distributions inJ:

$K=\left\{p(\cdot ):p(\cdot )=\sum _{i=1}^{n}a_{i}f_{i}(\cdot ;\theta _{i}),a_{i}>0,\sum _{i=1}^{n}a_{i}=1,f_{i}(\cdot ;\theta _{i})\in J\ \forall i,n\right\}$

K is said to be identifiable if all its members are unique, that is, given two membersp andp′ inK, being mixtures ofk distributions andk′ distributions respectively inJ, we havep =p′ if and only if, first of all,k =k′ and secondly we can reorder the summations such thata_i =a_i′ andf_i =f_i′ for alli.

Parameter estimation and system identification

[edit]

Parametric mixture models are often used when we know the distributionY and we can sample fromX, but we would like to determine thea_i andθ_i values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.

It is common to think of probability mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.

A variety of approaches to the problem of mixture decomposition have been proposed, many of which focus on maximum likelihood methods such asexpectation maximization (EM) or maximuma posteriori estimation (MAP). Generally these methods consider separately the questions of system identification and parameter estimation; methods to determine the number and functional form of components within a mixture are distinguished from methods to estimate the corresponding parameter values. Some notable departures are the graphical methods as outlined in Tarter and Lock^[13] and more recentlyminimum message length (MML) techniques such as Figueiredo and Jain^[14] and to some extent the moment matching pattern analysis routines suggested by McWilliam and Loh (2009).^[15]

Expectation maximization (EM)

[edit]

Expectation maximization (EM) is seemingly the most popular technique used to determine the parameters of a mixture with ana priori given number of components. This is a particular way of implementingmaximum likelihood estimation for this problem. EM is of particular appeal for finite normal mixtures where closed-form expressions are possible such as in the following iterative algorithm by Dempsteret al. (1977)^[16]

w_{s}^{(j+1)}={\frac {1}{N}}\sum _{t=1}^{N}h_{s}^{(j)}(t)

\mu _{s}^{(j+1)}={\frac {\sum _{t=1}^{N}h_{s}^{(j)}(t)x^{(t)}}{\sum _{t=1}^{N}h_{s}^{(j)}(t)}}

\Sigma _{s}^{(j+1)}={\frac {\sum _{t=1}^{N}h_{s}^{(j)}(t)[x^{(t)}-\mu _{s}^{(j+1)}][x^{(t)}-\mu _{s}^{(j+1)}]^{\top }}{\sum _{t=1}^{N}h_{s}^{(j)}(t)}}

with the posterior probabilities

h_{s}^{(j)}(t)={\frac {w_{s}^{(j)}p_{s}(x^{(t)};\mu _{s}^{(j)},\Sigma _{s}^{(j)})}{\sum _{i=1}^{n}w_{i}^{(j)}p_{i}(x^{(t)};\mu _{i}^{(j)},\Sigma _{i}^{(j)})}}.

Thus on the basis of the current estimate for the parameters, theconditional probability for a given observationx^(t) being generated from states is determined for eacht = 1, …,N ;N being the sample size. The parameters are then updated such that the new component weights correspond to the average conditional probability and each component mean and covariance is the component specific weighted average of the mean and covariance of the entire sample.

Dempster^[16] also showed that each successive EM iteration will not decrease the likelihood, a property not shared by other gradient based maximization techniques. Moreover, EM naturally embeds within it constraints on the probability vector, and for sufficiently large sample sizes positive definiteness of the covariance iterates. This is a key advantage since explicitly constrained methods incur extra computational costs to check and maintain appropriate values. Theoretically EM is a first-order algorithm and as such converges slowly to a fixed-point solution. Redner and Walker (1984)^{[full citation needed]} make this point arguing in favour of superlinear and second order Newton and quasi-Newton methods and reporting slow convergence in EM on the basis of their empirical tests. They do concede that convergence in likelihood was rapid even if convergence in the parameter values themselves was not. The relative merits of EM and other algorithms vis-à-vis convergence have been discussed in other literature.^[17]

Other common objections to the use of EM are that it has a propensity to spuriously identify local maxima, as well as displaying sensitivity to initial values.^[18]^[19] One may address these problems by evaluating EM at several initial points in the parameter space but this is computationally costly and other approaches, such as the annealing EM method of Udea and Nakano (1998) (in which the initial components are essentially forced to overlap, providing a less heterogeneous basis for initial guesses), may be preferable.

Figueiredo and Jain^[14] note that convergence to 'meaningless' parameter values obtained at the boundary (where regularity conditions breakdown, e.g., Ghosh and Sen (1985)) is frequently observed when the number of model components exceeds the optimal/true one. On this basis they suggest a unified approach to estimation and identification in which the initialn is chosen to greatly exceed the expected optimal value. Their optimization routine is constructed via a minimum message length (MML) criterion that effectively eliminates a candidate component if there is insufficient information to support it. In this way it is possible to systematize reductions inn and consider estimation and identification jointly.

The expectation step

[edit]

With initial guesses for the parameters of our mixture model, "partial membership" of each data point in each constituent distribution is computed by calculatingexpectation values for the membership variables of each data point. That is, for each data pointx_j and distributionY_i, the membership valuey_i,j is:

y_{i,j}={\frac {a_{i}f_{Y}(x_{j};\theta _{i})}{f_{X}(x_{j})}}.

The maximization step

[edit]

With expectation values in hand for group membership,plug-in estimates are recomputed for the distribution parameters.

The mixing coefficientsa_i are themeans of the membership values over theN data points.

a_{i}={\frac {1}{N}}\sum _{j=1}^{N}y_{i,j}

The component model parametersθ_i are also calculated by expectation maximization using data pointsx_j that have been weighted using the membership values. For example, ifθ is a meanμ

\mu _{i}={\frac {\sum _{j}y_{i,j}x_{j}}{\sum _{j}y_{i,j}}}.

With new estimates fora_i and theθ_i's, the expectation step is repeated to recompute new membership values. The entire procedure is repeated until model parameters converge.

Markov chain Monte Carlo

[edit]

As an alternative to the EM algorithm, the mixture model parameters can be deduced usingposterior sampling as indicated byBayes' theorem. This is still regarded as an incomplete data problem in which membership of data points is the missing data. A two-step iterative procedure known asGibbs sampling can be used.

The previous example of a mixture of twoGaussian distributions can demonstrate how the method works. As before, initial guesses of the parameters for the mixture model are made. Instead of computing partial memberships for each elemental distribution, a membership value for each data point is drawn from aBernoulli distribution (that is, it will be assigned to either the first or the second Gaussian). The Bernoulli parameterθ is determined for each data point on the basis of one of the constituent distributions.^[vague] Draws from the distribution generate membership associations for each data point. Plug-in estimators can then be used as in the M step of EM to generate a new set of mixture model parameters, and the binomial draw step repeated.

Moment matching

[edit]

Themethod of moment matching is one of the oldest techniques for determining the mixture parameters dating back to Karl Pearson's seminal work of 1894.In this approach the parameters of the mixture are determined such that the composite distribution has moments matching some given value. In many instances extraction of solutions to the moment equations may present non-trivial algebraic or computational problems. Moreover, numerical analysis by Day^[20] has indicated that such methods may be inefficient compared to EM. Nonetheless, there has been renewed interest in this method, e.g., Craigmile and Titterington (1998) and Wang.^[21]

McWilliam and Loh (2009) consider the characterisation of a hyper-cuboid normal mixturecopula in large dimensional systems for which EM would be computationally prohibitive. Here a pattern analysis routine is used to generate multivariate tail-dependencies consistent with a set of univariate and (in some sense) bivariate moments. The performance of this method is then evaluated using equity log-return data withKolmogorov–Smirnov test statistics suggesting a good descriptive fit.

Spectral method

[edit]

Some problems in mixture model estimation can be solved usingspectral methods.In particular it becomes useful if data pointsx_i are points in high-dimensionalreal space, and the hidden distributions are known to belog-concave (such asGaussian distribution orExponential distribution).

Spectral methods of learning mixture models are based on the use ofSingular Value Decomposition of a matrix which contains data points.The idea is to consider the topk singular vectors, wherek is the number of distributions to be learned. The projectionof each data point to alinear subspace spanned by those vectors groups points originating from the same distributionvery close together, while points from different distributions stay far apart.

One distinctive feature of the spectral method is that it allows us toprove that ifdistributions satisfy certain separation condition (e.g., not too close), then the estimated mixture will be very close to the true one with high probability.

Graphical Methods

[edit]

Tarter and Lock^[13] describe a graphical approach to mixture identification in which a kernel function is applied to an empirical frequency plot so to reduce intra-component variance. In this way one may more readily identify components having differing means. While thisλ-method does not require prior knowledge of the number or functional form of the components its success does rely on the choice of the kernel parameters which to some extent implicitly embeds assumptions about the component structure.

Other methods

[edit]

Some of them can even probably learn mixtures ofheavy-tailed distributions including those withinfinitevariance (seelinks to papers below).In this setting, EM based methods would not work, since the Expectation step would diverge due to presence ofoutliers.

A simulation

[edit]

To simulate a sample of sizeN that is from a mixture of distributionsF_i,i=1 ton, with probabilitiesp_i (sum= p_i = 1):

GenerateN random numbers from acategorical distribution of sizen and probabilitiesp_i fori= 1= to n. These tell you which of theF_i each of theN values will come from. Denote bym_i the quantity of random numbers assigned to thei^th category.
For eachi, generatem_i random numbers from theF_i distribution.

Extensions

[edit]

In aBayesian setting, additional levels can be added to thegraphical model defining the mixture model. For example, in the commonlatent Dirichlet allocation topic model, the observations are sets of words drawn fromD different documents and theK mixture components represent topics that are shared across documents. Each document has a different set of mixture weights, which specify the topics prevalent in that document. All sets of mixture weights share commonhyperparameters.

A very common extension is to connect thelatent variables defining the mixture component identities into aMarkov chain, instead of assuming that they areindependent identically distributed random variables. The resulting model is termed ahidden Markov model and is one of the most common sequential hierarchical models. Numerous extensions of hidden Markov models have been developed; see the resulting article for more information.

History

[edit]

Mixture distributions and the problem of mixture decomposition, that is the identification of its constituent components and the parameters thereof, has been cited in the literature as far back as 1846 (Quetelet in McLachlan,^[18] 2000) although common reference is made to the work ofKarl Pearson (1894)^[22] as the first author to explicitly address the decomposition problem in characterising non-normal attributes of forehead to body length ratios in female shore crab populations. The motivation for this work was provided by the zoologistWalter Frank Raphael Weldon who had speculated in 1893 (in Tarter and Lock^[13]) that asymmetry in the histogram of these ratios could signal evolutionary divergence. Pearson's approach was to fit a univariate mixture of two normals to the data by choosing the five parameters of the mixture such that the empirical moments matched that of the model.

While his work was successful in identifying two potentially distinct sub-populations and in demonstrating the flexibility of mixtures as a moment matching tool, the formulation required the solution of a 9th degree (nonic) polynomial which at the time posed a significant computational challenge.

Subsequent works focused on addressing these problems, but it was not until the advent of the modern computer and the popularisation ofMaximum Likelihood (MLE) parameterisation techniques that research really took off.^[23] Since that time there has been a vast body of research on the subject spanning areas such asfisheries research,agriculture,botany,economics,medicine,genetics,psychology,palaeontology,electrophoresis,finance,geology andzoology.^[24]

References

[edit]

^Pal, Samyajoy; Heumann, Christian (2024). "Flexible Multivariate Mixture Models: A Comprehensive Approach for Modeling Mixtures of Non-Identical Distributions".International Statistical Review insr.12593.doi:10.1111/insr.12593.
^Chatzis, Sotirios P.; Kosmopoulos, Dimitrios I.; Varvarigou, Theodora A. (2008). "Signal Modeling and Classification Using a Robust Latent Space Model Based on t Distributions".IEEE Transactions on Signal Processing.56 (3):949–963.Bibcode:2008ITSP...56..949C.doi:10.1109/TSP.2007.907912.S2CID 15583243.
^Yu, Guoshen (2012). "Solving Inverse Problems with Piecewise Linear Estimators: From Gaussian Mixture Models to Structured Sparsity".IEEE Transactions on Image Processing.21 (5):2481–2499.arXiv:1006.3056.Bibcode:2012ITIP...21.2481G.doi:10.1109/tip.2011.2176743.PMID 22180506.S2CID 479845.
^Dinov, ID. "Expectation Maximization and Mixture Modeling Tutorial".California Digital Library, Statistics Online Computational Resource, Paper EM_MM,http://repositories.cdlib.org/socr/EM_MM, December 9, 2008
^Bishop, Christopher (2006).Pattern recognition and machine learning. New York: Springer.ISBN 978-0-387-31073-2.
^Spall, J. C. and Maryak, J. L. (1992). "A feasible Bayesian estimator of quantiles for projectile accuracy from non-i.i.d. data."Journal of the American Statistical Association, vol. 87 (419), pp. 676–681.JSTOR 2290205
^Amruthnath, Nagdev; Gupta, Tarun (2018-02-02).Fault Class Prediction in Unsupervised Learning using Model-Based Clustering Approach. Unpublished.doi:10.13140/rg.2.2.22085.14563.
^Amruthnath, Nagdev; Gupta, Tarun (2018-02-01).A Research Study on Unsupervised Machine Learning Algorithms for Fault Detection in Predictive Maintenance. Unpublished.doi:10.13140/rg.2.2.28822.24648.
^Shen, Jianhong (Jackie) (2006)."A stochastic-variational model for soft Mumford-Shah segmentation".International Journal of Biomedical Imaging.2006 092329:2–16.Bibcode:2006IJBI.200649515H.doi:10.1155/IJBI/2006/92329.PMC 2324060.PMID 23165059.
^Myronenko, Andriy; Song, Xubo (2010). "Point set registration: Coherent point drift".IEEE Trans. Pattern Anal. Mach. Intell.32 (12):2262–2275.arXiv:0905.2635.Bibcode:2010ITPAM..32.2262M.doi:10.1109/TPAMI.2010.46.PMID 20975122.S2CID 10809031.
^Ravikumar, Nishant; Gooya, Ali; Cimen, Serkan; Frangi, Alexjandro; Taylor, Zeike (2018)."Group-wise similarity registration of point sets using Student's t-mixture model for statistical shape models".Med. Image Anal.44:156–176.doi:10.1016/j.media.2017.11.012.PMID 29248842.
^Bayer, Siming; Ravikumar, Nishant; Strumia, Maddalena; Tong, Xiaoguang; Gao, Ying; Ostermeier, Martin; Fahrig, Rebecca; Maier, Andreas (2018)."Intraoperative brain shift compensation using a hybrid mixture model".Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Granada, Spain: Springer, Cham. pp. 116–124.doi:10.1007/978-3-030-00937-3_14.
^^a ^b ^cTarter, Michael E. (1993),Model Free Curve Estimation, Chapman and Hall
^^a ^bFigueiredo, M.A.T.; Jain, A.K. (March 2002). "Unsupervised Learning of Finite Mixture Models".IEEE Transactions on Pattern Analysis and Machine Intelligence.24 (3):381–396.Bibcode:2002ITPAM..24..381F.CiteSeerX 10.1.1.362.9811.doi:10.1109/34.990138.
^McWilliam, N.; Loh, K. (2008),Incorporating Multidimensional Tail-Dependencies in the Valuation of Credit Derivatives (Working Paper)[1]
^^a ^bDempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm".Journal of the Royal Statistical Society, Series B.39 (1):1–38.CiteSeerX 10.1.1.163.7580.doi:10.1111/j.2517-6161.1977.tb01600.x.JSTOR 2984875.
^Xu, L.; Jordan, M.I. (January 1996). "On Convergence Properties of the EM Algorithm for Gaussian Mixtures".Neural Computation.8 (1):129–151.doi:10.1162/neco.1996.8.1.129.hdl:10338.dmlcz/135225.S2CID 207714252.
^^a ^bMcLachlan, G.J. (2000),Finite Mixture Models, Wiley
^Botev, Z.I.;Kroese, D.P. (2004). "Global Likelihood Optimization Via the Cross-Entropy Method, with an Application to Mixture Models".Proceedings of the 2004 Winter Simulation Conference, 2004. Vol. 1. pp. 517–523.CiteSeerX 10.1.1.331.2319.doi:10.1109/WSC.2004.1371358.ISBN 978-0-7803-8786-7.S2CID 6880171.
^Day, N. E. (1969). "Estimating the Components of a Mixture of Normal Distributions".Biometrika.56 (3):463–474.doi:10.2307/2334652.JSTOR 2334652.
^Wang, J. (2001), "Generating daily changes in market variables using a multivariate mixture of normal distributions",Proceedings of the 33rd Winter Conference on Simulation:283–289
^Améndola, Carlos; et al. (2015). "Moment varieties of Gaussian mixtures".Journal of Algebraic Statistics.7.arXiv:1510.04654.Bibcode:2015arXiv151004654A.doi:10.18409/jas.v7i1.42.S2CID 88515304.
^McLachlan, G.J.; Basford, K.E. (1988), "Mixture Models: inference and applications to clustering",Statistics: Textbooks and Monographs,Bibcode:1988mmia.book.....M
^Titterington, Smith & Makov 1985

External links

[edit]

Nielsen, Frank (23 March 2012). "K-MLE: A fast algorithm for learning statistical mixture models".2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 869–872.arXiv:1203.5181.Bibcode:2012arXiv1203.5181N.doi:10.1109/ICASSP.2012.6288022.ISBN 978-1-4673-0046-9.S2CID 935615.
TheSOCR demonstrations of EM and Mixture Modeling
Mixture modelling page (and theSnob program forMinimum Message Length (MML) applied to finite mixture models), maintained by D.L. Dowe.
PyMix – Python Mixture Package, algorithms and data structures for a broad variety of mixture model based data mining applications in Python
sklearn.mixture – A module from thescikit-learn Python library for learning Gaussian Mixture Models (and sampling from them), previously packaged withSciPy and now packaged as aSciKit
GMM.m Matlab code for GMM Implementation
GPUmix C++ implementation of Bayesian Mixture Models using EM and MCMC with 100x speed acceleration using GPGPU.
[2] Matlab code for GMM Implementation using EM algorithm
[3] jMEF: A Java open source library for learning and processing mixtures of exponential families (using duality with Bregman divergences). Includes a Matlab wrapper.
Very Fast and clean C implementation of theExpectation Maximization (EM) algorithm for estimatingGaussian Mixture Models (GMMs).
mclust is an R package for mixture modeling.
dpgmm Pure Python Dirichlet process Gaussian mixture model implementation (variational).
Gaussian Mixture Models Blog post on Gaussian Mixture Models trained via Expectation Maximization, with an implementation in Python.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Mixture_model&oldid=1317416543"

Categories:

Hidden categories:

[8]ページ先頭

Movatterモバイル変換

Structure

General mixture model

Specific examples

Gaussian mixture model

Multivariate Gaussian mixture model

Categorical mixture model

Examples

A financial model

House prices

Topics in a document

Handwriting recognition

Assessing projectile accuracy (a.k.a. circular error probable, CEP)

Direct and indirect applications

Predictive Maintenance

Fuzzy image segmentation

Point set registration

Identifiability

Example

Definition

Parameter estimation and system identification

Expectation maximization (EM)

The expectation step

The maximization step

Markov chain Monte Carlo

Moment matching

Spectral method

Graphical Methods

Other methods

A simulation

Extensions

History

See also

Mixture

Hierarchical models

Outlier detection

References

Further reading

Books on mixture models

Application of Gaussian mixture models

External links