Movatterモバイル変換

Jump to content

Multinomial logistic regression

From Wikipedia, the free encyclopedia

Regression for more than two discrete outcomes

"Multinomial regression" redirects here. For the related Probit procedure, seeMultinomial probit.

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Multinomial logistic regression" – news ·newspapers ·books ·scholar ·JSTOR(November 2011) (Learn how and when to remove this message)

Regression analysis
Part of a series on
Models
Linear regression Simple regression Polynomial regression General linear model
Generalized linear model Vector generalized linear model Discrete choice Binomial regression Binary regression Logistic regression Multinomial logistic regression Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Linear mixed-effects model Nonlinear mixed-effects model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Linear Non-linear
Ordinary Weighted Generalized Generalized estimating equation
Partial Total Non-negative Ridge regression Regularized
Least absolute deviations Iteratively reweighted Bayesian Bayesian multivariate Least-squares spectral analysis
Background
Regression validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Mathematics portal
v t e

Instatistics,multinomial logistic regression is aclassification method that generalizeslogistic regression tomulticlass problems, i.e. with more than two possible discrete outcomes.^[1] That is, it is a model that is used to predict the probabilities of the different possible outcomes of acategorically distributed dependent variable, given a set ofindependent variables (which may be real-valued, binary-valued, categorical-valued, etc.).

Multinomial logistic regression is known by a variety of other names, includingpolytomous LR,^[2]^[3]multiclass LR,softmax regression,multinomial logit (mlogit), themaximum entropy (MaxEnt) classifier, and theconditional maximum entropy model.^[4]

Background

Multinomial logistic regression is used when thedependent variable in question isnominal (equivalentlycategorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be:

Which major will a college student choose, given their grades, stated likes and dislikes, etc.?
Which blood type does a person have, given the results of various diagnostic tests?
In a hands-free mobile phone dialing application, which person's name was spoken, given various properties of the speech signal?
Which candidate will a person vote for, given particular demographic characteristics?
Which country will a firm locate an office in, given the characteristics of the firm and of the various candidate countries?

These are allstatistical classification problems. They all have in common adependent variable to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set ofindependent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken).

Assumptions

The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to bestatistically independent from each other (unlike, for example, in anaive Bayes classifier); however,collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case.^[5]

If the multinomial logit is used to model choices, it relies on the assumption ofindependence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice ofK alternatives to be modeled as a set ofK − 1 independent binary choices, in which one alternative is chosen as a "pivot" and the otherK − 1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was aperfect substitute for a blue bus.

If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like thenested logit or themultinomial probit may be used in such cases as they allow for violation of the IIA.^[6]

Model

See also:Logistic regression

Introduction

There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article onlogistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model.

The idea behind all of them, as in many otherstatistical classification techniques, is to construct alinear predictor function that constructs a score from a set of weights that arelinearly combined with the explanatory variables (features) of a given observation using adot product:

\operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},

whereX_i is the vector of explanatory variables describing observationi,β_k is a vector of weights (orregression coefficients) corresponding to outcomek, and score(X_i,k) is the score associated with assigning observationi to categoryk. Indiscrete choice theory, where observations represent people and outcomes represent choices, the score is considered theutility associated with personi choosing outcomek. The predicted outcome is the one with the highest score.

The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (theperceptron algorithm,support vector machines,linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating theprobability of observationi choosing outcomek given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a largepredictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.9⁵ = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.8⁵ = 33% accuracy. This issue is known aserror propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue.^{[citation needed]}

Setup

The basic setup is the same as inlogistic regression, the only difference being that thedependent variables arecategorical rather thanbinary, i.e. there areK possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult thelogistic regression article.

Data points

Specifically, it is assumed that we have a series ofN observed data points. Each data pointi (ranging from 1 toN) consists of a set ofM explanatory variablesx_1,i ...x_M,i (also known asindependent variables, predictor variables, features, etc.), and an associatedcategorical outcomeY_i (also known asdependent variable, response variable), which can take on one ofK possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 toK. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations ofN "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so that the outcome of a new "experiment" can be correctly predicted for a new data point for which the explanatory variables, but not the outcome, are available. In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome.

Some examples:

The observed outcomes are different variants of a disease such ashepatitis (possibly including "no disease" and/or other related diseases) in a set of patients, and the explanatory variables might be characteristics of the patients thought to be pertinent (sex, race, age,blood pressure, outcomes of various liver-function tests, etc.). The goal is then to predict which disease is causing the observed liver-related symptoms in a new patient.
The observed outcomes are the party chosen by a set of people in an election, and the explanatory variables are the demographic characteristics of each person (e.g. sex, race, age, income, etc.). The goal is then to predict the likely vote of a new voter with given characteristics.

Linear predictor

As in other forms of linear regression, multinomial logistic regression uses alinear predictor function $f(k,i)$ to predict the probability that observationi has outcomek, of the following form:

f(k,i)=\beta _{0,k}+\beta _{1,k}x_{1,i}+\beta _{2,k}x_{2,i}+\cdots +\beta _{M,k}x_{M,i},

where $\beta _{m,k}$ is aregression coefficient associated with themth explanatory variable and thekth outcome. As explained in thelogistic regression article, the regression coefficients and explanatory variables are normally grouped into vectors of sizeM + 1, so that the predictor function can be written more compactly:

f(k,i)={\boldsymbol {\beta }}_{k}\cdot \mathbf {x} _{i},

where ${\boldsymbol {\beta }}_{k}$ is the set of regression coefficients associated with outcomek, and $\mathbf {x} _{i}$ (a row vector) is the set of explanatory variables associated with observationi, prepended by a 1 in entry 0.

As a set of independent binary regressions

To arrive at the multinomial logit model, one can imagine, forK possible outcomes, runningK independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the otherK − 1 outcomes are separately regressed against the pivot outcome. If outcomeK (the last outcome) is chosen as the pivot, theK − 1 regression equations are:

\ln {\frac {\Pr(Y_{i}=k)}{\Pr(Y_{i}=K)}}\,=\,{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},\;\;\;\;\;\;1\leq k<K

.

This formulation is also known as theAdditive Log Ratio transform commonly used in compositional data analysis. In other applications it’s referred to as “relative risk”.^[7]

If we exponentiate both sides and solve for the probabilities, we get:

\Pr(Y_{i}=k)\,=\,{\Pr(Y_{i}=K)}\;e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}},\;\;\;\;\;\;1\leq k<K

Using the fact that allK of the probabilities must sum to one, we find:

{\begin{aligned}\Pr(Y_{i}=K)={}&1-\sum _{j=1}^{K-1}\Pr(Y_{i}=j)\\={}&1-\sum _{j=1}^{K-1}{\Pr(Y_{i}=K)}\;e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}\;\;\Rightarrow \;\;\Pr(Y_{i}=K)\\={}&{\frac {1}{1+\sum _{j=1}^{K-1}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}}.\end{aligned}}

We can use this to find the other probabilities:

\Pr(Y_{i}=k)={\frac {e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}{1+\sum _{j=1}^{K-1}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}},\;\;\;\;\;\;1\leq k<K

.

The fact that we run multiple regressions reveals why the model relies on the assumption ofindependence of irrelevant alternatives described above.

Estimating the coefficients

The unknown parameters in each vectorβ_k are typically jointly estimated bymaximum a posteriori (MAP) estimation, which is an extension ofmaximum likelihood usingregularization of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-meanGaussian prior distribution on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such asgeneralized iterative scaling,^[8]iteratively reweighted least squares (IRLS),^[9] by means ofgradient-based optimization algorithms such asL-BFGS,^[4] or by specializedcoordinate descent algorithms.^[10]

As a log-linear model

The formulation of binary logistic regression as alog-linear model can be directly extended to multi-way regression. That is, we model thelogarithm of the probability of seeing a given output using the linear predictor as well as an additionalnormalization factor, the logarithm of thepartition function:

\ln \Pr(Y_{i}=k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}-\ln Z,\;\;\;\;\;\;1\leq k\leq K.

As in the binary case, we need an extra term $-\ln Z$ to ensure that the whole set of probabilities forms aprobability distribution, i.e. so that they all sum to one:

\sum _{k=1}^{K}\Pr(Y_{i}=k)=1

The reason why we need to add a term to ensure normalization, rather than multiply as is usual, is because we have taken the logarithm of the probabilities. Exponentiating both sides turns the additive term into a multiplicative factor, so that the probability is just theGibbs measure:

\Pr(Y_{i}=k)={\frac {1}{Z}}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}},\;\;\;\;\;\;1\leq k\leq K.

The quantityZ is called thepartition function for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1:

1=\sum _{k=1}^{K}\Pr(Y_{i}=k)\;=\;\sum _{k=1}^{K}{\frac {1}{Z}}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}\;=\;{\frac {1}{Z}}\sum _{k=1}^{K}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}.

Therefore

Z=\sum _{k=1}^{K}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}.

Note that this factor is "constant" in the sense that it is not a function ofY_i, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficientsβ_k, which we will need to determine through some sort ofoptimization procedure.

The resulting equations for the probabilities are

\Pr(Y_{i}=k)={\frac {e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}{\sum _{j=1}^{K}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}},\;\;\;\;\;\;1\leq k\leq K.

The following function:

\operatorname {softmax} (k,s_{1},\ldots ,s_{K})={\frac {e^{s_{k}}}{\sum _{j=1}^{K}e^{s_{j}}}}

is referred to as thesoftmax function. The reason is that the effect of exponentiating the values $s_{1},\ldots ,s_{K}$ is to exaggerate the differences between them. As a result, $\operatorname {softmax} (k,s_{1},\ldots ,s_{K})$ will return a value close to 0 whenever $s_{k}$ is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct aweighted average that behaves as asmooth function (which can be convenientlydifferentiated, etc.) and which approximates theindicator function

f(k)={\begin{cases}1&{\textrm {if}}\;k=\operatorname {\arg \max } _{j}s_{j},\\0&{\textrm {otherwise}}.\end{cases}}

Thus, we can write the probability equations as

\Pr(Y_{i}=k)=\operatorname {softmax} (k,{\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i},\ldots ,{\boldsymbol {\beta }}_{K}\cdot \mathbf {X} _{i})

The softmax function thus serves as the equivalent of thelogistic function in binary logistic regression.

Note that not all of the ${\boldsymbol {\beta }}_{k}$ vectors of coefficients are uniquelyidentifiable. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only $K-1$ separately specifiable probabilities, and hence $K-1$ separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical:

{\begin{aligned}{\frac {e^{({\boldsymbol {\beta }}_{k}+\mathbf {C} )\cdot \mathbf {X} _{i}}}{\sum _{j=1}^{K}e^{({\boldsymbol {\beta }}_{j}+\mathbf {C} )\cdot \mathbf {X} _{i}}}}&={\frac {e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}e^{\mathbf {C} \cdot \mathbf {X} _{i}}}{\sum _{j=1}^{K}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}e^{\mathbf {C} \cdot \mathbf {X} _{i}}}}\\&={\frac {e^{\mathbf {C} \cdot \mathbf {X} _{i}}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}{e^{\mathbf {C} \cdot \mathbf {X} _{i}}\sum _{j=1}^{K}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}}\\&={\frac {e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}{\sum _{j=1}^{K}e^{{\boldsymbol {\beta }}_{j}\cdot \mathbf {X} _{i}}}}\end{aligned}}

As a result, it is conventional to set $\mathbf {C} =-{\boldsymbol {\beta }}_{K}$ (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes ${\boldsymbol {0}}$ , and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of theK choices, and examining how much better or worse all of the otherK − 1 choices are, relative to the choice we are pivoting around. Mathematically, we transform the coefficients as follows:

{\begin{aligned}{\boldsymbol {\beta }}'_{k}&={\boldsymbol {\beta }}_{k}-{\boldsymbol {\beta }}_{K},\;\;\;\;1\leq k<K,\\{\boldsymbol {\beta }}'_{K}&=0.\end{aligned}}

This leads to the following equations:

\Pr(Y_{i}=k)={\frac {e^{{\boldsymbol {\beta }}'_{k}\cdot \mathbf {X} _{i}}}{1+\sum _{j=1}^{K-1}e^{{\boldsymbol {\beta }}'_{j}\cdot \mathbf {X} _{i}}}},\;\;\;\;\;\;1\leq k\leq K

Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms ofK − 1 independent two-way regressions.

As a latent-variable model

It is also possible to formulate multinomial logistic regression as a latent variable model, following thetwo-way latent variable model described for binary logistic regression. This formulation is common in the theory ofdiscrete choice models, and makes it easier to compare multinomial logistic regression to the relatedmultinomial probit model, as well as to extend it to more complex models.

Imagine that, for each data pointi and possible outcomek = 1,2,...,K, there is a continuouslatent variableY_i,k^* (i.e. an unobservedrandom variable) that is distributed as follows:

Y_{i,k}^{\ast }={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}+\varepsilon _{k}\;\;\;\;,\;\;k\leq K

where $\varepsilon _{k}\sim \operatorname {EV} _{1}(0,1),$ i.e. a standard type-1extreme value distribution.

This latent variable can be thought of as theutility associated with data pointi choosing outcomek, where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable $Y_{i}$ is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcomek is chosenif and only if the associated utility (the value of $Y_{i,k}^{\ast }$ ) is greater than the utilities of all the other choices, i.e. if the utility associated with outcomek is the maximum of all the utilities. Since the latent variables arecontinuous, the probability of two having exactly the same value is 0, so we ignore the scenario. That is:

{\begin{aligned}\Pr(Y_{i}=1)&=\Pr(Y_{i,1}^{\ast }>Y_{i,2}^{\ast }{\text{ and }}Y_{i,1}^{\ast }>Y_{i,3}^{\ast }{\text{ and }}\cdots {\text{ and }}Y_{i,1}^{\ast }>Y_{i,K}^{\ast })\\\Pr(Y_{i}=2)&=\Pr(Y_{i,2}^{\ast }>Y_{i,1}^{\ast }{\text{ and }}Y_{i,2}^{\ast }>Y_{i,3}^{\ast }{\text{ and }}\cdots {\text{ and }}Y_{i,2}^{\ast }>Y_{i,K}^{\ast })\\&\,\,\,\vdots \\\Pr(Y_{i}=K)&=\Pr(Y_{i,K}^{\ast }>Y_{i,1}^{\ast }{\text{ and }}Y_{i,K}^{\ast }>Y_{i,2}^{\ast }{\text{ and }}\cdots {\text{ and }}Y_{i,K}^{\ast }>Y_{i,K-1}^{\ast })\\\end{aligned}}

Or equivalently:

\Pr(Y_{i}=k)\;=\;\Pr(\max(Y_{i,1}^{\ast },Y_{i,2}^{\ast },\ldots ,Y_{i,K}^{\ast })=Y_{i,k}^{\ast })\;\;\;\;,\;\;k\leq K

Let's look more closely at the first equation, which we can write as follows:

{\begin{aligned}\Pr(Y_{i}=1)&=\Pr(Y_{i,1}^{\ast }>Y_{i,k}^{\ast }\ \forall \ k=2,\ldots ,K)\\&=\Pr(Y_{i,1}^{\ast }-Y_{i,k}^{\ast }>0\ \forall \ k=2,\ldots ,K)\\&=\Pr({\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i}+\varepsilon _{1}-({\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}+\varepsilon _{k})>0\ \forall \ k=2,\ldots ,K)\\&=\Pr(({\boldsymbol {\beta }}_{1}-{\boldsymbol {\beta }}_{k})\cdot \mathbf {X} _{i}>\varepsilon _{k}-\varepsilon _{1}\ \forall \ k=2,\ldots ,K)\end{aligned}}

There are a few things to realize here:

In general, if $X\sim \operatorname {EV} _{1}(a,b)$ and $Y\sim \operatorname {EV} _{1}(a,b)$ then $X-Y\sim \operatorname {Logistic} (0,b).$ That is, the difference of twoindependent identically distributed extreme-value-distributed variables follows thelogistic distribution, where the first parameter is unimportant. This is understandable since the first parameter is alocation parameter, i.e. it shifts the mean by a fixed amount, and if two values are both shifted by the same amount, their difference remains the same. This means that all of the relational statements underlying the probability of a given choice involve the logistic distribution, which makes the initial choice of the extreme-value distribution, which seemed rather arbitrary, somewhat more understandable.
The second parameter in an extreme-value or logistic distribution is ascale parameter, such that if $X\sim \operatorname {Logistic} (0,1)$ then $bX\sim \operatorname {Logistic} (0,b).$ This means that the effect of using an error variable with an arbitrary scale parameter in place of scale 1 can be compensated simply by multiplying all regression vectors by the same scale. Together with the previous point, this shows that the use of a standard extreme-value distribution (location 0, scale 1) for the error variables entails no loss of generality over using an arbitrary extreme-value distribution. In fact, the model isnonidentifiable (no single set of optimal coefficients) if the more general distribution is used.
Because only differences of vectors of regression coefficients are used, adding an arbitrary constant to all coefficient vectors has no effect on the model. This means that, just as in the log-linear model, onlyK − 1 of the coefficient vectors are identifiable, and the last one can be set to an arbitrary value (e.g. 0).

Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particularorder statistic (the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent.

Estimation of intercept

When using multinomial logistic regression, one category of the dependent variable is chosen as the reference category. Separateodds ratios are determined for all independent variables for each category of the dependent variable with the exception of the reference category, which is omitted from the analysis. The exponential beta coefficient represents the change in the odds of the dependent variable being in a particular category vis-a-vis the reference category, associated with a one unit change of the corresponding independent variable.

Likelihood function

The observed values $y_{i}\in \{1,\dots ,K\}$ for $i=1,\dots ,n$ of the explained variables are considered as realizations of stochastically independent,categorically distributed random variables $Y_{1},\dots ,Y_{n}$ .

Thelikelihood function for this model is defined by

L=\prod _{i=1}^{n}P(Y_{i}=y_{i})=\prod _{i=1}^{n}\prod _{j=1}^{K}P(Y_{i}=j)^{\delta _{j,y_{i}}},

where the index $i {\displaystyle i}$ denotes the observations 1 ton and the index $j {\displaystyle j}$ denotes the classes 1 toK. $\delta _{j,y_{i}}={\begin{cases}1,{\text{ for }}j=y_{i}\\0,{\text{ otherwise}}\end{cases}}$ is theKronecker delta.

The negative log-likelihood function is therefore the well-known cross-entropy:

-\log L=-\sum _{i=1}^{n}\sum _{j=1}^{K}\delta _{j,y_{i}}\log(P(Y_{i}=j))=-\sum _{j=1}^{K}\sum _{y_{i}=j}\log(P(Y_{i}=j)).

Application in natural language processing

Innatural language processing, multinomial LR classifiers are commonly used as an alternative tonaive Bayes classifiers because they do not assumestatistical independence of the random variables (commonly known asfeatures) that serve as predictors. However, learning in such a model is slower than for a naive Bayes classifier, and thus may not be appropriate given a very large number of classes to learn. In particular, learning in a naive Bayes classifier is a simple matter of counting up the number of co-occurrences of features and classes, while in a maximum entropy classifier the weights, which are typically maximized usingmaximum a posteriori (MAP) estimation, must be learned using an iterative procedure; see#Estimating the coefficients.

See also

References

^Greene, William H. (2012).Econometric Analysis (Seventh ed.). Boston: Pearson Education. pp. 803–806.ISBN 978-0-273-75356-8.
^Engel, J. (1988). "Polytomous logistic regression".Statistica Neerlandica.42 (4):233–252.doi:10.1111/j.1467-9574.1988.tb01238.x.
^Menard, Scott (2002).Applied Logistic Regression Analysis. SAGE. p. 91.ISBN 9780761922087.
^^a ^bMalouf, Robert (2002).A comparison of algorithms for maximum entropy parameter estimation(PDF). Sixth Conf. on Natural Language Learning (CoNLL). pp. 49–55.
^Belsley, David (1991).Conditioning diagnostics : collinearity and weak data in regression. New York: Wiley.ISBN 9780471528890.
^Baltas, G.; Doyle, P. (2001). "Random Utility Models in Marketing Research: A Survey".Journal of Business Research.51 (2):115–125.doi:10.1016/S0148-2963(99)00058-2.
^Stata Manual “mlogit — Multinomial (polytomous) logistic regression”
^Darroch, J.N. & Ratcliff, D. (1972)."Generalized iterative scaling for log-linear models".The Annals of Mathematical Statistics.43 (5):1470–1480.doi:10.1214/aoms/1177692379.
^Bishop, Christopher M. (2006).Pattern Recognition and Machine Learning. Springer. pp. 206–209.
^Yu, Hsiang-Fu; Huang, Fang-Lan; Lin, Chih-Jen (2011)."Dual coordinate descent methods for logistic regression and maximum entropy models"(PDF).Machine Learning.85 (1–2):41–75.doi:10.1007/s10994-010-5221-8.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Multinomial_logistic_regression&oldid=1278597985"

Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp