Movatterモバイル変換

[0]ホーム

Jump to content

Polynomial kernel

Edit links

From Wikipedia, the free encyclopedia

Machine learning kernel function

This article is about machine learning. For polynomial kernels in complexity theory, seeKernelization.

Illustration of the mapping $\varphi$ . On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel $K(x,y)$ (for some values of the parameters $c {\displaystyle c}$ and $d {\displaystyle d}$ ) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.

{\displaystyle \varphi } — Illustration of the mapping $\varphi$ . On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel $K(x,y)$ (for some values of the parameters $c {\displaystyle c}$ and $d {\displaystyle d}$ ) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.

Inmachine learning, thepolynomial kernel is akernel function commonly used withsupport vector machines (SVMs) and otherkernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context ofregression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that ofpolynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond tological conjunctions of input features.^[1]

Definition

[edit]

For degree-d polynomials, the polynomial kernel is defined as^[2]

K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}

wherex andy are vectors of sizen in theinput space, i.e. vectors of features computed from training or test samples andc ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. Whenc = 0, the kernel is called homogeneous.^[3] (A further generalized polykernel dividesx^Ty by a user-specified scalar parametera.^[4])

As a kernel,K corresponds to an inner product in a feature space based on some mappingφ:

K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle

The nature ofφ can be seen from an example. Letd = 2, so we get the special case of the quadratic kernel. After using themultinomial theorem (twice—the outermost application is thebinomial theorem) and regrouping,

K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}

From this it follows that the feature map is given by:

\varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)

generalizing for $\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}$ ,where $\mathbf {x} \in \mathbb {R} ^{n}$ , $\mathbf {y} \in \mathbb {R} ^{n}$ and applying themultinomial theorem:

${\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}$

The last summation has $l_{d}={\tbinom {n+d}{d}}$ elements, so that:

\varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)

where $l=(j_{1},j_{2},...,j_{n},j_{n+1})$ and

a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d

Practical use

[edit]

Although theRBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular innatural language processing (NLP).^[1]^[5]The most common degree isd = 2 (quadratic), since larger degrees tend tooverfit on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

full expansion of the kernel prior to training/testing with a linear SVM,^[5] i.e. full computation of the mappingφ as in polynomial regression;
basket mining (using a variant of theapriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion;^[6]
inverted indexing of support vectors.^[6]^[1]

One problem with the polynomial kernel is that it may suffer fromnumerical instability: whenx^Ty +c < 1,K(x,y) = (x^Ty +c)^d tends to zero with increasingd, whereas whenx^Ty +c > 1,K(x,y) tends to infinity.^[4]

^[7]

References

[edit]

^^a ^b ^cYoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.
^"Archived copy"(PDF). Archived fromthe original(PDF) on 2013-04-15. Retrieved2012-11-12.{{cite web}}: CS1 maint: archived copy as title (link)
^Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577".arXiv:0904.3664v1 [cs.LG].
^^a ^bLin, Chih-Jen (2012).Machine learning software: design and practical use(PDF). Machine Learning Summer School. Kyoto.
^^a ^bChang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010)."Training and testing low-degree polynomial data mappings via linear SVM".Journal of Machine Learning Research.11:1471–1490.
^^a ^bKudo, T.; Matsumoto, Y. (2003).Fast methods for kernel-based text analysis. Proc. ACL.
^Lin, Chih-Jen (2012).Shewchuk(PDF). Machine Learning Summer School. Kyoto.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Polynomial_kernel&oldid=1337746775"

Category:

Kernel methods for machine learning

Hidden categories:

[8]ページ先頭