7.7.Kernel Approximation #

This submodule contains functions that approximate the feature mappings thatcorrespond to certain kernels, as they are used for example in support vectormachines (seeSupport Vector Machines).The following feature functions perform non-linear transformations of theinput, which can serve as a basis for linear classification or otheralgorithms.

The advantage of using approximate explicit feature maps compared to thekernel trick,which makes use of feature maps implicitly, is that explicit mappingscan be better suited for online learning and can significantly reduce the costof learning with very large datasets.Standard kernelized SVMs do not scale well to large datasets, but using anapproximate kernel map it is possible to use much more efficient linear SVMs.In particular, the combination of kernel map approximations withSGDClassifier can make non-linear learning on large datasets possible.

Since there has not been much empirical work using approximate embeddings, itis advisable to compare results against exact kernel methods when possible.

7.7.1.Nystroem Method for Kernel Approximation#

The Nystroem method, as implemented inNystroem is a general method forreduced rank approximations of kernels. It achieves this by subsampling withoutreplacement rows/columns of the data on which the kernel is evaluated. While thecomputational complexity of the exact method is\(\mathcal{O}(n^3_{\text{samples}})\), the complexity of the approximationis\(\mathcal{O}(n^2_{\text{components}} \cdot n_{\text{samples}})\), whereone can set\(n_{\text{components}} \ll n_{\text{samples}}\) without asignificant decrease in performance[WS2001].

We can construct the eigendecomposition of the kernel matrix\(K\), basedon the features of the data, and then split it into sampled and unsampled datapoints.

\[\begin{split}K = U \Lambda U^T= \begin{bmatrix} U_1 \\ U_2\end{bmatrix} \Lambda \begin{bmatrix} U_1 \\ U_2 \end{bmatrix}^T= \begin{bmatrix} U_1 \Lambda U_1^T & U_1 \Lambda U_2^T \\ U_2 \Lambda U_1^T & U_2 \Lambda U_2^T \end{bmatrix}\equiv \begin{bmatrix} K_{11} & K_{12} \\ K_{21} & K_{22} \end{bmatrix}\end{split}\]

where:

\(U\) is orthonormal
\(\Lambda\) is diagonal matrix of eigenvalues
\(U_1\) is orthonormal matrix of samples that were chosen
\(U_2\) is orthonormal matrix of samples that were not chosen

Given that\(U_1 \Lambda U_1^T\) can be obtained by orthonormalization ofthe matrix\(K_{11}\), and\(U_2 \Lambda U_1^T\) can be evaluated (aswell as its transpose), the only remaining term to elucidate is\(U_2 \Lambda U_2^T\). To do this we can express it in terms of the alreadyevaluated matrices:

\[\begin{split}\begin{align} U_2 \Lambda U_2^T &= \left(K_{21} U_1 \Lambda^{-1}\right) \Lambda \left(K_{21} U_1 \Lambda^{-1}\right)^T\\&= K_{21} U_1 (\Lambda^{-1} \Lambda) \Lambda^{-1} U_1^T K_{21}^T\\&= K_{21} U_1 \Lambda^{-1} U_1^T K_{21}^T\\&= K_{21} K_{11}^{-1} K_{21}^T\\&= \left( K_{21} K_{11}^{-\frac12} \right) \left( K_{21} K_{11}^{-\frac12} \right)^T.\end{align}\end{split}\]

Duringfit, the classNystroem evaluates the basis\(U_1\), andcomputes the normalization constant,\(K_{11}^{-\frac12}\). Later, duringtransform, the kernel matrix is determined between the basis (given by thecomponents_ attribute) and the new data points,X. This matrix is thenmultiplied by thenormalization_ matrix for the final result.

By defaultNystroem uses therbf kernel, but it can use any kernelfunction or a precomputed kernel matrix. The number of samples used - which isalso the dimensionality of the features computed - is given by the parametern_components.

Examples

See the example entitledTime-related feature engineering,that shows an efficient machine learning pipeline that uses aNystroem kernel.
SeeExplicit feature map approximation for RBF kernelsfor a comparison ofNystroem kernel withRBFSampler.

7.7.2.Radial Basis Function Kernel#

TheRBFSampler constructs an approximate mapping for the radial basisfunction kernel, also known asRandom Kitchen Sinks[RR2007]. Thistransformation can be used to explicitly model a kernel map, prior to applyinga linear algorithm, for example a linear SVM:

>>>fromsklearn.kernel_approximationimportRBFSampler>>>fromsklearn.linear_modelimportSGDClassifier>>>X=[[0,0],[1,1],[1,0],[0,1]]>>>y=[0,0,1,1]>>>rbf_feature=RBFSampler(gamma=1,random_state=1)>>>X_features=rbf_feature.fit_transform(X)>>>clf=SGDClassifier(max_iter=5)>>>clf.fit(X_features,y)SGDClassifier(max_iter=5)>>>clf.score(X_features,y)1.0

The mapping relies on a Monte Carlo approximation to thekernel values. Thefit function performs the Monte Carlo sampling, whereasthetransform method performs the mapping of the data. Because of theinherent randomness of the process, results may vary between different calls tothefit function.

Thefit function takes two arguments:n_components, which is the target dimensionality of the feature transform,andgamma, the parameter of the RBF-kernel. A highern_components willresult in a better approximation of the kernel and will yield results moresimilar to those produced by a kernel SVM. Note that “fitting” the featurefunction does not actually depend on the data given to thefit function.Only the dimensionality of the data is used.Details on the method can be found in[RR2007].

For a given value ofn_componentsRBFSampler is often less accurateasNystroem.RBFSampler is cheaper to compute, though, makinguse of larger feature spaces more efficient.

../_images/sphx_glr_plot_kernel_approximation_002.png — Comparing an exact RBF kernel (left) with the approximation (right)#

Examples

SeeExplicit feature map approximation for RBF kernels for acomparison ofNystroem kernel withRBFSampler.

7.7.3.Additive Chi Squared Kernel#

The additive chi squared kernel is a kernel on histograms, often used in computer vision.

The additive chi squared kernel as used here is given by

\[k(x, y) = \sum_i \frac{2x_iy_i}{x_i+y_i}\]

This is not exactly the same assklearn.metrics.pairwise.additive_chi2_kernel.The authors of[VZ2010] prefer the version above as it is always positivedefinite.Since the kernel is additive, it is possible to treat all components\(x_i\) separately for embedding. This makes it possible to samplethe Fourier transform in regular intervals, instead of approximatingusing Monte Carlo sampling.

The classAdditiveChi2Sampler implements this component wisedeterministic sampling. Each component is sampled\(n\) times, yielding\(2n+1\) dimensions per input dimension (the multiple of two stemsfrom the real and complex part of the Fourier transform).In the literature,\(n\) is usually chosen to be 1 or 2, transformingthe dataset to sizen_samples*5*n_features (in the case of\(n=2\)).

The approximate feature map provided byAdditiveChi2Sampler can be combinedwith the approximate feature map provided byRBFSampler to yield an approximatefeature map for the exponentiated chi squared kernel.See the[VZ2010] for details and[VVZ2010] for combination with theRBFSampler.

7.7.4.Skewed Chi Squared Kernel#

The skewed chi squared kernel is given by:

\[k(x,y) = \prod_i \frac{2\sqrt{x_i+c}\sqrt{y_i+c}}{x_i + y_i + 2c}\]

It has properties that are similar to the exponentiated chi squared kerneloften used in computer vision, but allows for a simple Monte Carloapproximation of the feature map.

The usage of theSkewedChi2Sampler is the same as the usage describedabove for theRBFSampler. The only difference is in the freeparameter, that is called\(c\).For a motivation for this mapping and the mathematical details see[LS2010].

7.7.5.Polynomial Kernel Approximation via Tensor Sketch#

Thepolynomial kernel is a popular type of kernelfunction given by:

\[k(x, y) = (\gamma x^\top y +c_0)^d\]

where:

x,y are the input vectors
d is the kernel degree

Intuitively, the feature space of the polynomial kernel of degreedconsists of all possible degree-d products among input features, which enableslearning algorithms using this kernel to account for interactions between features.

The TensorSketch[PP2013] method, as implemented inPolynomialCountSketch, is ascalable, input data independent method for polynomial kernel approximation.It is based on the concept of Count sketch[WIKICS][CCF2002] , a dimensionalityreduction technique similar to feature hashing, which instead uses severalindependent hash functions. TensorSketch obtains a Count Sketch of the outer productof two vectors (or a vector with itself), which can be used as an approximation of thepolynomial kernel feature space. In particular, instead of explicitly computingthe outer product, TensorSketch computes the Count Sketch of the vectors and thenuses polynomial multiplication via the Fast Fourier Transform to compute theCount Sketch of their outer product.

Conveniently, the training phase of TensorSketch simply consists of initializingsome random variables. It is thus independent of the input data, i.e. it onlydepends on the number of input features, but not the data values.In addition, this method can transform samples in\(\mathcal{O}(n_{\text{samples}}(n_{\text{features}} + n_{\text{components}} \log(n_{\text{components}})))\)time, where\(n_{\text{components}}\) is the desired output dimension,determined byn_components.

Examples

Scalable learning with polynomial kernel approximation

7.7.6.Mathematical Details#

Kernel methods like support vector machines or kernelizedPCA rely on a property of reproducing kernel Hilbert spaces.For any positive definite kernel function\(k\) (a so called Mercer kernel),it is guaranteed that there exists a mapping\(\phi\)into a Hilbert space\(\mathcal{H}\), such that

\[k(x,y) = \langle \phi(x), \phi(y) \rangle\]

Where\(\langle \cdot, \cdot \rangle\) denotes the inner product in theHilbert space.

If an algorithm, such as a linear support vector machine or PCA,relies only on the scalar product of data points\(x_i\), one may usethe value of\(k(x_i, x_j)\), which corresponds to applying the algorithmto the mapped data points\(\phi(x_i)\).The advantage of using\(k\) is that the mapping\(\phi\) never hasto be calculated explicitly, allowing for arbitrary largefeatures (even infinite).

One drawback of kernel methods is, that it might be necessaryto store many kernel values\(k(x_i, x_j)\) during optimization.If a kernelized classifier is applied to new data\(y_j\),\(k(x_i, y_j)\) needs to be computed to make predictions,possibly for many different\(x_i\) in the training set.

The classes in this submodule allow to approximate the embedding\(\phi\), thereby working explicitly with the representations\(\phi(x_i)\), which obviates the need to apply the kernelor store training examples.

References

[WS2001]

“Using the Nyström method to speed up kernel machines”Williams, C.K.I.; Seeger, M. - 2001.

[RR2007](1,2)

“Random features for large-scale kernel machines”Rahimi, A. and Recht, B. - Advances in neural information processing 2007,

[LS2010]

“Random Fourier approximations for skewed multiplicative histogram kernels”Li, F., Ionescu, C., and Sminchisescu, C.- Pattern Recognition, DAGM 2010, Lecture Notes in Computer Science.

[VZ2010](1,2)

“Efficient additive kernels via explicit feature maps”Vedaldi, A. and Zisserman, A. - Computer Vision and Pattern Recognition 2010

[VVZ2010]

“Generalized RBF feature maps for Efficient Detection”Vempati, S. and Vedaldi, A. and Zisserman, A. and Jawahar, CV - 2010

[PP2013]