Movatterモバイル変換

Linear classifier

From Wikipedia, the free encyclopedia

Statistical classification in machine learning

Inmachine learning, alinear classifier makes aclassification decision for each object based on alinear combination of itsfeatures. Such classifiers work well for practical problems such asdocument classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.^[1]

Definition

[edit]

In this case, the solid and empty dots can be correctly classified by any number of linear classifiers. H1 (blue) classifies them correctly, as does H2 (red). H2 could be considered "better" in the sense that it is also furthest from both groups. H3 (green) fails to correctly classify the dots.

If the input feature vector to the classifier is areal vector ${\vec {x}}$ , then the output score is

y=f({\vec {w}}\cdot {\vec {x}})=f\left(\sum _{j}w_{j}x_{j}\right),

where ${\vec {w}}$ is a real vector of weights andf is a function that converts thedot product of the two vectors into the desired output. (In other words, ${\vec {w}}$ is aone-form orlinear functional mapping ${\vec {x}}$ ontoR.) The weight vector ${\vec {w}}$ is learned from a set of labeled training samples. Oftenf is athreshold function, which maps all values of ${\vec {w}}\cdot {\vec {x}}$ above a certain threshold to the first class and all other values to the second class; e.g.,

f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} ^{T}\cdot \mathbf {x} >\theta ,\\0&{\text{otherwise}}\end{cases}}

The superscript T indicates the transpose and $\theta$ is a scalar threshold. A more complexf might give the probability that an item belongs to a certain class.

For a two-class classification problem, one can visualize the operation of a linear classifier as splitting ahigh-dimensional input space with ahyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no".

A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when ${\vec {x}}$ is sparse. Also, linear classifiers often work very well when the number of dimensions in ${\vec {x}}$ is large, as indocument classification, where each element in ${\vec {x}}$ is typically the number of occurrences of a word in a document (seedocument-term matrix). In such cases, the classifier should be well-regularized.

Generative models vs. discriminative models

[edit]

There are two broad classes of methods for determining the parameters of a linear classifier ${\vec {w}}$ . They can begenerative anddiscriminative models.^[2]^[3] Methods of the former modeljoint probability distribution, whereas methods of the latter modelconditional density functions $P({\rm {class}}|{\vec {x}})$ . Examples of such algorithms include:

Linear Discriminant Analysis (LDA)—assumesGaussian conditional density models
Naive Bayes classifier with multinomial or multivariate Bernoulli event models.

The second set of methods includesdiscriminative models, which attempt to maximize the quality of the output on atraining set. Additional terms in the training cost function can easily performregularization of the final model. Examples of discriminative training of linear classifiers include:

Logistic regression—maximum likelihood estimation of ${\vec {w}}$ assuming that the observed training set was generated by a binomial model that depends on the output of the classifier.
Perceptron—an algorithm that attempts to fix all errors encountered in the training set
Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification.^[4]
Support vector machine—an algorithm that maximizes themargin between the decision hyperplane and the examples in the training set.

Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main lineardimensionality reduction algorithm:principal components analysis (PCA). LDA is asupervised learning algorithm that utilizes the labels of the data, while PCA is anunsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact.^[5]

Discriminative training often yields higher accuracy than modeling the conditional density functions^{[citation needed]}. However, handling missing data is often easier with conditional density models^{[citation needed]}.

All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space $\varphi ({\vec {x}})$ , using thekernel trick.

Discriminative training

[edit]

Discriminative training of linear classifiers usually proceeds in asupervised way, by means of anoptimization algorithm that is given a training set with desired outputs and aloss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form^[1]

{\underset {\mathbf {w} }{\arg \min }}\;R(\mathbf {w} )+C\sum _{i=1}^{N}L(y_{i},\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i})

where

w is a vector of classifier parameters,
L(y_i,w^Tx_i) is a loss function that measures the discrepancy between the classifier's prediction and the true outputy_i for thei'th training example,
R(w) is aregularization function that prevents the parameters from getting too large (causingoverfitting), and
C is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function.

Popular loss functions include thehinge loss (for linear SVMs) and thelog loss (for linear logistic regression). If the regularization functionR isconvex, then the above is aconvex problem.^[1] Many algorithms exist for solving such problems; popular ones for linear classification include (stochastic)gradient descent,L-BFGS,coordinate descent andNewton methods.

Notes

[edit]

^^a ^b ^cGuo-Xun Yuan; Chia-Hua Ho; Chih-Jen Lin (2012)."Recent Advances of Large-Scale Linear Classification"(PDF).Proc. IEEE.100 (9).Archived(PDF) from the original on 2017-06-10.
^T. Mitchell,Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Draft Version, 2005
^A. Y. Ng and M. I. Jordan.On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes. in NIPS 14, 2002.
^R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001).ISBN 0-471-05669-3
^Duda, Richard O.; Hart, Peter E.; Stork, David G. (2001).Pattern classification. A Wiley-Interscience publication (Second ed.). New York Chichester Weinheim Brisbane Singapore Toronto: John Wiley & Sons, Inc. p. 117.ISBN 978-0-471-05669-0.

Movatterモバイル変換

Linear classifier

Definition

Generative models vs. discriminative models

Discriminative training

See also

Notes

Further reading