Movatterモバイル変換

HyperLogLog

From Wikipedia, the free encyclopedia

Approximate distinct counting algorithm

Probabilistic data structures
Part ofa series on
Bloom filter Count sketch Count–min sketch Quotient filter Skip list
Random trees
Random binary tree Treap Rapidly exploring random tree
Related
Randomized algorithm HyperLogLog
v t e

HyperLogLog is an algorithm for thecount-distinct problem, approximating the number of distinct elements in amultiset.^[1] Calculating theexactcardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10⁹ with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.^[1] HyperLogLog is an extension of the earlier LogLog algorithm,^[2] itself deriving from the 1984Flajolet–Martin algorithm.^[3]

Terminology

[edit]

In the original paper by Flajoletet al.^[1] and in related literature on thecount-distinct problem, the term "cardinality" is used to mean the number of distinct elements in a data stream with repeated elements. However in the theory ofmultisets the term refers to the sum of multiplicities of each member of a multiset. This article chooses to use Flajolet's definition for consistency with the sources.

Algorithm

[edit]

This section includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this section byintroducing more precise citations.(March 2014) (Learn how and when to remove this message)

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2ⁿ.^[1]

In the HyperLogLog algorithm, ahash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a largevariance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using aharmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.^[4]

Operations

[edit]

The HyperLogLog has three main operations:add to add a new element to the set,count to obtain the cardinality of the set andmerge to obtain the union of two sets. Some derived operations can be computed using theinclusion–exclusion principle like thecardinality of the intersection or thecardinality of the difference between two HyperLogLogs combining the merge and count operations.

The data of the HyperLogLog is stored in an arrayM ofm counters (or "registers") that are initialized to 0. ArrayM initialized from a multisetS is calledHyperLogLogsketch of S.

Add

[edit]

The add operation consists of computing the hash of the input datav with a hash functionh, getting the firstb bits (whereb is ${\textstyle \log _{2}(m)}$ ), and adding 1 to them to obtain the address of the register to modify. With the remaining bits compute ${\textstyle \rho (w)}$ which returns the position of the leftmost 1, where leftmost position is 1 (in other words: number of leading zeros plus 1). The new value of the register will be the maximum between the current value of the register and ${\textstyle \rho (w)}$ .

{\begin{aligned}x&:=h(v)\\j&:=1+\langle x_{1}x_{2}...x_{b}\rangle _{2}\\w&:=x_{b+1}x_{b+2}...\\M[j]&:=\max(M[j],\rho (w))\\\end{aligned}}

Count

[edit]

The count algorithm consists in computing the harmonic mean of them registers, and using a constant to derive an estimate ${\textstyle E}$ of the count:

Z={\Bigg (}\sum _{j=1}^{m}{2^{-M[j]}}{\Bigg )}^{-1}

\alpha _{m}=\left(m\int _{0}^{\infty }\left(\log _{2}\left({\frac {2+u}{1+u}}\right)\right)^{m}\,du\right)^{-1}

E=\alpha _{m}m^{2}Z

The intuition is thatn being the unknown cardinality ofM, each subset ${\textstyle M_{j}}$ will have ${\textstyle n/m}$ elements. Then ${\textstyle \max _{x\in M_{j}}\rho (x)}$ should be close to ${\textstyle \log _{2}(n/m)}$ . The harmonic mean of 2 to these quantities is ${\textstyle mZ}$ which should be near ${\textstyle n/m}$ . Thus, ${\textstyle m^{2}Z}$ should ben approximately.

Finally, the constant ${\textstyle \alpha _{m}}$ is introduced to correct a systematic multiplicative bias present in ${\textstyle m^{2}Z}$ due to hash collisions.

Practical considerations

[edit]

The constant ${\textstyle \alpha _{m}}$ is not simple to calculate, and can be approximated with the formula^[1]

\alpha _{m}\approx {\begin{cases}0.673,&{\text{for }}m=16;\\0.697,&{\text{for }}m=32;\\0.709,&{\text{for }}m=64;\\{\frac {0.7213}{1+1.079/m}},&{\text{for }}m\geq 128.\end{cases}}

The HyperLogLog technique, though, is biased for small cardinalities below a threshold of ${\textstyle {\frac {5}{2}}m}$ . The original paper proposes using a different algorithm for small cardinalities known as Linear Counting.^[5] In the case where the estimate provided above is less than the threshold ${\textstyle E<{\frac {5}{2}}m}$ , the alternative calculation can be used:

Let ${\textstyle V}$ be the count of registers equal to 0.
If ${\textstyle V=0}$ , use the standard HyperLogLog estimator ${\textstyle E}$ above.
Otherwise, use Linear Counting: ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$

Additionally, for very large cardinalities approaching the limit of the size of the registers ( ${\textstyle E>{\frac {2^{32}}{30}}}$ for 32-bit registers), the cardinality can be estimated with:

E^{\star }=-2^{32}\log \left(1-{\frac {E}{2^{32}}}\right)

With the above corrections for lower and upper bounds, the error can be estimated as ${\textstyle \sigma =1.04/{\sqrt {m}}}$ .

Merge

[edit]

The merge operation for two HLLs ( ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ ) consists in obtaining the maximum for each pair of registers ${\textstyle j:1..m}$

{\mathit {hll}}_{\text{union}}[j]=\max({\mathit {hll}}_{1}[j],{\mathit {hll}}_{2}[j])

Complexity

[edit]

To analyze the complexity, the data streaming $(\epsilon ,\delta )$ model^[6] is used, which analyzes the space necessary to get a $1\pm \epsilon$ approximation with a fixed success probability $1-\delta$ . The relative error of HLL is $1.04/{\sqrt {m}}$ and it needs $O(\epsilon ^{-2}\log \log n+\log n)$ space, wheren is the set cardinality andm is the number of registers (usually less than one byte size).

Theadd operation depends on the size of the output of the hash function. As this size is fixed, we can consider the running time for the add operation to be $O(1)$ .

Thecount andmerge operations depend on the number of registersm and have a theoretical cost of $O(m)$ . In some implementations (Redis)^[7] the number of registers is fixed and the cost is considered to be $O(1)$ in the documentation.

HLL++

[edit]

The HyperLogLog++ algorithm proposes several improvements in the HyperLogLog algorithm to reduce memory requirements and increase accuracy in some ranges of cardinalities:^[6]

64-bit hash function is used instead of the 32 bits used in the original paper. This reduces the hash collisions for large cardinalities allowing to remove the large range correction.
Some bias is found for small cardinalities when switching from linear counting to the HLL counting. An empirical bias correction is proposed to mitigate the problem.
A sparse representation of the registers is proposed to reduce memory requirements for small cardinalities, which can be later transformed to a dense representation if the cardinality grows.

Streaming HLL

[edit]

When the data arrives in a single stream, the Historic Inverse Probability or martingale estimator^[8]^[9]significantly improves the accuracy of the HLL sketch and uses 36% less memory to achieve a given error level. This estimator is provably optimal for any duplicate insensitive approximate distinct counting sketch on a single stream.

The single stream scenario also leads to variants in the HLL sketch construction.HLL-TailCut+ uses 45% less memory than the original HLL sketch but at the cost of being dependent on the data insertion order and not being able to merge sketches.^[10]

References

[edit]

^^a ^b ^c ^d ^eFlajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007)."Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"(PDF).Discrete Mathematics and Theoretical Computer Science Proceedings.AH.Nancy, France:137–156.CiteSeerX 10.1.1.76.4286. Retrieved2016-12-11.
^Durand, M.; Flajolet, P. (2003)."LogLog counting of large cardinalities."(PDF). In G. Di Battista and U. Zwick (ed.).Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). Vol. 2832. Springer. pp. 605–617.
^Flajolet, Philippe; Martin, G. Nigel (1985)."Probabilistic counting algorithms for data base applications"(PDF).Journal of Computer and System Sciences.31 (2):182–209.doi:10.1016/0022-0000(85)90041-8.
^S Heule; M Nunkesser; A Hall (2013)."HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm"(PDF). sec 4.
^Whang, Kyu-Young; Vander-Zanden, Brad T; Taylor, Howard M (1990)."A linear-time probabilistic counting algorithm for database applications".ACM Transactions on Database Systems.15 (2):208–229.doi:10.1145/78922.78925.S2CID 2939101.
^^a ^b"HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm". Retrieved2014-04-19.
^"PFCOUNT – Redis".
^Cohen, E. (March 2015). "All-distances sketches, revisited: HIP estimators for massive graphs analysis".IEEE Transactions on Knowledge and Data Engineering.27 (9):2320–2334.arXiv:1306.3284.doi:10.1109/TKDE.2015.2411606.
^Ting, D. (August 2014)."Streamed approximate counting of distinct elements".Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 442–451.doi:10.1145/2623330.2623669.ISBN 978-1-4503-2956-9.S2CID 13179875.
^Xiao, Q.; Zhou, Y.; Chen, S. (May 2017). "Better with fewer bits: Improving the performance of cardinality estimation of large data streams".IEEE INFOCOM 2017 - IEEE Conference on Computer Communications. pp. 1–9.doi:10.1109/INFOCOM.2017.8057088.ISBN 978-1-5090-5336-0.S2CID 27159273.