Movatterモバイル変換

Count sketch

From Wikipedia, the free encyclopedia

Method of a dimension reduction

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural radiance field Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Count sketch is a type ofdimensionality reduction that is particularly efficient instatistics,machine learning andalgorithms.^[1]^[2]It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton^[3] in an effort to speed up theAMS Sketch by Alon, Matias and Szegedy for approximating thefrequency moments of streams^[4] (these calculations require counting of the number of occurrences for the distinct elements of the stream).

The sketch is nearly identical^{[citation needed]} to theFeature hashing algorithm by John Moody,^[5] but differs in its use of hash functions with low dependence, which makes it more practical.In order to still have a high probability of success, themedian trick is used to aggregate multiple count sketches, rather than the mean.

These properties allow use for explicitkernel methods, bilinearpooling inneural networks and is a cornerstone in many numerical linear algebra algorithms.^[6]

Intuitive explanation

[edit]

The inventors of this data structure offer the following iterative explanation of its operation:^[3]

at the simplest level, the output of a singlehash functions mapping stream elementsq into {+1, -1} is feeding a singleup/down counterC. After a single pass over the data, the frequency $n(q)$ of a stream elementq can be approximated, although extremely poorly, by theexpected value ${\mathbf {E}}[C\cdot s(q)]$ ;
a straightforward way to improve thevariance of the previous estimate is to use an array of different hash functions $s_{i}$ , each connected to its own counter $C_{i}$ . For each elementq, the ${\mathbf {E}}[C_{i}\cdot s_{i}(q)]=n(q)$ still holds, so averaging across thei range will tighten the approximation;
the previous construct still has a major deficiency: if a lower-frequency-but-still-important output elementa exhibits ahash collision with a high-frequency element, $n(a)$ estimate can be significantly affected. Avoiding this requires reducing the frequency of collision counter updates between any two distinct elements. This is achieved by replacing each $C_{i}$ in the previous construct with an array ofm counters (making the counter set into a two-dimensional matrix $C_{i,j}$ ), with indexj of a particular counter to be incremented/decremented selected via another set of hash functions $h_{i}$ that map elementq into the range {1..m}. Since ${\mathbf {E}}[C_{i,h_{i}(q)}\cdot s_{i}(q)]=n(q)$ , averaging across all values ofi will work.

Mathematical definition

[edit]

1. For constants $w {\displaystyle w}$ and $t {\displaystyle t}$ (to be defined later) independently choose $d=2t+1$ random hash functions $h_{1},\dots ,h_{d}$ and $s_{1},\dots ,s_{d}$ such that $h_{i}:[n]\to [w]$ and $s_{i}:[n]\to \{\pm 1\}$ .It is necessary that the hash families from which $h_{i}$ and $s_{i}$ are chosen be pairwise independent.

2. For each item $q_{i}$ in the stream, add $s_{j}(q_{i})$ to the $h_{j}(q_{i})$ th bucket of the $j {\displaystyle j}$ th hash.

At the end of this process, one has $w d {\displaystyle wd}$ sums $(C_{ij})$ where

C_{i,j}=\sum _{h_{i}(k)=j}s_{i}(k).

To estimate the count of $q {\displaystyle q}$ s one computes the following value:

r_{q}={\text{median}}_{i=1}^{d}\,s_{i}(q)\cdot C_{i,h_{i}(q)}.

The values $s_{i}(q)\cdot C_{i,h_{i}(q)}$ are unbiased estimates of how many times $q {\displaystyle q}$ has appeared in the stream.

The estimate $r_{q}$ has variance $O(\mathrm {min} \{m_{1}^{2}/w^{2},m_{2}^{2}/w\})$ , where $m_{1}$ is the length of the stream and $m_{2}^{2}$ is $\sum _{q}(\sum _{i}[q_{i}=q])^{2}$ .^[7]

Furthermore, $r_{q}$ is guaranteed to never be more than $2m_{2}/{\sqrt {w}}$ off from the true value, with probability $1-e^{-O(t)}$ .

Vector formulation

[edit]

Alternatively Count-Sketch can be seen as a linear mapping with a non-linear reconstruction function.Let $M^{(i\in [d])}\in \{-1,0,1\}^{w\times n}$ , be a collection of $d=2t+1$ matrices, defined by

M_{h_{i}(j),j}^{(i)}=s_{i}(j)

for $j\in [w]$ and 0 everywhere else.

Then a vector $v\in \mathbb {R} ^{n}$ is sketched by $C^{(i)}=M^{(i)}v\in \mathbb {R} ^{w}$ .To reconstruct $v {\displaystyle v}$ we take $v_{j}^{*}={\text{median}}_{i}C_{j}^{(i)}s_{i}(j)$ .This gives the same guarantees as stated above, if we take $m_{1}=\|v\|_{1}$ and $m_{2}=\|v\|_{2}$ .

Relation to Tensor sketch

[edit]

The count sketch projection of theouter product of two vectors is equivalent to theconvolution of two component count sketches.

The count sketch computes a vectorconvolution

$C^{(1)}x\ast C^{(2)}x^{T}$ , where $C^{(1)}$ and $C^{(2)}$ are independent count sketch matrices.

Pham and Pagh^[8] show that this equals $C(x\otimes x^{T})$ – a count sketch $C {\displaystyle C}$ of theouter product of vectors, where $\otimes$ denotesKronecker product.

Thefast Fourier transform can be used to do fast convolution of count sketches.By using theface-splitting product^[9]^[10]^[11] such structures can be computed much faster than normal matrices.

References

[edit]

^Faisal M. Algashaam; Kien Nguyen; Mohamed Alkanhal; Vinod Chandran; Wageeh Boles. "Multispectral Periocular Classification WithMultimodal Compact Multi-Linear Pooling" [1].IEEE Access, Vol. 5. 2017.
^Ahle, Thomas; Knudsen, Jakob (2019-09-03)."Almost Optimal Tensor Sketch".ResearchGate. Retrieved2020-07-11.
^^a ^bCharikar, Chen & Farach-Colton 2004.
^Alon, Noga, Yossi Matias, and Mario Szegedy. "The space complexity of approximating the frequency moments." Journal of Computer and system sciences 58.1 (1999): 137-147.
^Moody, John. "Fast learning in multi-resolution hierarchies." Advances in neural information processing systems. 1989.
^Woodruff, David P. "Sketching as a Tool for Numerical Linear Algebra." Theoretical Computer Science 10.1-2 (2014): 1–157.
^Larsen, Kasper Green, Rasmus Pagh, and Jakub Tětek. "CountSketches, Feature Hashing and the Median of Three." International Conference on Machine Learning. PMLR, 2021.
^Ninh, Pham;Pagh, Rasmus (2013).Fast and scalable polynomial kernels via explicit feature maps. SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery.doi:10.1145/2487575.2487591.
^Slyusar, V. I. (1998)."End products in matrices in radar applications"(PDF).Radioelectronics and Communications Systems.41 (3):50–53.
^Slyusar, V. I. (1997-05-20)."Analytical model of the digital antenna array on a basis of face-splitting matrix products"(PDF).Proc. ICATT-97, Kyiv:108–109.
^Slyusar, V. I. (March 13, 1998)."A Family of Face Products of Matrices and its Properties"(PDF).Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999.35 (3):379–384.doi:10.1007/BF02733426.S2CID 119661450.

Movatterモバイル変換

Count sketch

Intuitive explanation

Mathematical definition

Vector formulation

Relation to Tensor sketch

See also

References

Further reading