Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Flajolet–Martin algorithm

From Wikipedia, the free encyclopedia

TheFlajolet–Martin algorithm is analgorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (thecount-distinct problem). The algorithm was introduced byPhilippe Flajolet andG. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications".[1] Later it has been refined in "LogLog counting of large cardinalities" byMarianne Durand andPhilippe Flajolet,[2] and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" byPhilippe Flajolet et al.[3]

In their 2010 article "An optimal algorithm for the distinct elements problem",[4]Daniel M. Kane,Jelani Nelson andDavid P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimalO(1) update and reporting times.

The algorithm

[edit]

Assume that we are given ahash functionhash(x){\displaystyle \mathrm {hash} (x)} that maps inputx{\displaystyle x} to integers in the range[0;2L1]{\displaystyle [0;2^{L}-1]}, and where the outputs are sufficientlyuniformly distributed. Note that the set of integers from 0 to2L1{\displaystyle 2^{L}-1} corresponds to the set of binary strings of lengthL{\displaystyle L}. For any non-negative integery{\displaystyle y}, definebit(y,k){\displaystyle \mathrm {bit} (y,k)} to be thek{\displaystyle k}-th bit in the binary representation ofy{\displaystyle y}, such that:

y=k0bit(y,k)2k.{\displaystyle y=\sum _{k\geq 0}\mathrm {bit} (y,k)2^{k}.}

We then define a functionρ(y){\displaystyle \rho (y)} that outputs the position of the least-significant set bit in the binary representation ofy{\displaystyle y}, andL{\displaystyle L} if no such set bit can be found as all bits are zero:

ρ(y)={min{k0bit(y,k)0}y>0Ly=0{\displaystyle \rho (y)={\begin{cases}\min\{k\geq 0\mid \mathrm {bit} (y,k)\neq 0\}&y>0\\L&y=0\end{cases}}}

Note that with the above definition we are using 0-indexing for the positions, starting from the least significant bit. For example,ρ(13)=ρ(11012)=0{\displaystyle \rho (13)=\rho (1101_{2})=0}, since the least significant bit is a 1 (0th position), andρ(8)=ρ(10002)=3{\displaystyle \rho (8)=\rho (1000_{2})=3}, since the least significant set bit is at the 3rd position. At this point, note that under the assumption that the output of our hash function is uniformly distributed, then the probability of observing a hash output ending with2k{\displaystyle 2^{k}} (a one, followed byk{\displaystyle k} zeroes) is2(k+1){\displaystyle 2^{-(k+1)}}, since this corresponds to flippingk{\displaystyle k} heads and then a tail with a fair coin.

Now the Flajolet–Martin algorithm for estimating the cardinality of amultisetM{\displaystyle M} is as follows:

  1. Initialize a bit-vector BITMAP to be of lengthL{\displaystyle L} and contain all 0s.
  2. For each elementx{\displaystyle x} inM{\displaystyle M}:
    1. Calculate the indexi=ρ(hash(x)){\displaystyle i=\rho (\mathrm {hash} (x))}.
    2. SetBITMAP[i]=1{\displaystyle \mathrm {BITMAP} [i]=1}.
  3. LetR{\displaystyle R} denote the smallest indexi{\displaystyle i} such thatBITMAP[i]=0{\displaystyle \mathrm {BITMAP} [i]=0}.
  4. Estimate the cardinality ofM{\displaystyle M} as2R/ϕ{\displaystyle 2^{R}/\phi }, whereϕ0.77351{\displaystyle \phi \approx 0.77351}.

The idea is that ifn{\displaystyle n} is the number of distinct elements in the multisetM{\displaystyle M}, thenBITMAP[0]{\displaystyle \mathrm {BITMAP} [0]} is accessed approximatelyn/2{\displaystyle n/2} times,BITMAP[1]{\displaystyle \mathrm {BITMAP} [1]} is accessed approximatelyn/4{\displaystyle n/4} times and so on. Consequently, ifilog2n{\displaystyle i\gg \log _{2}n}, thenBITMAP[i]{\displaystyle \mathrm {BITMAP} [i]} is almost certainly 0, and ifilog2n{\displaystyle i\ll \log _{2}n}, thenBITMAP[i]{\displaystyle \mathrm {BITMAP} [i]} is almost certainly 1. Ifilog2n{\displaystyle i\approx \log _{2}n}, thenBITMAP[i]{\displaystyle \mathrm {BITMAP} [i]} can be expected to be either 1 or 0.

The correction factorϕ0.77351{\displaystyle \phi \approx 0.77351} is found by calculations, which can be found in the original article.

Improving accuracy

[edit]

A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times withk{\displaystyle k} different hash functions and combine the results from the different runs. One idea is to take the mean of thek{\displaystyle k} results together from each hash function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use themedian, which is less prone to be influences by outliers. The problem with this is that the results can only take form2R/ϕ{\displaystyle 2^{R}/\phi }, whereR{\displaystyle R} is integer. A common solution is to combine both the mean and the median: Createkl{\displaystyle k\cdot l} hash functions and split them intok{\displaystyle k} distinct groups (each of sizel{\displaystyle l}). Within each group use the mean for aggregating together thel{\displaystyle l} results, and finally take the median of thek{\displaystyle k} group estimates as the final estimate.[5]

The 2007HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses theharmonic mean to combine them into an estimate for the original cardinality.[3]

See also

[edit]

References

[edit]
  1. ^Flajolet, Philippe; Martin, G. Nigel (1985)."Probabilistic counting algorithms for data base applications"(PDF).Journal of Computer and System Sciences.31 (2):182–209.doi:10.1016/0022-0000(85)90041-8. Retrieved2016-12-11.
  2. ^Durand, Marianne; Flajolet, Philippe (2003)."Loglog Counting of Large Cardinalities"(PDF).Algorithms - ESA 2003. Lecture Notes in Computer Science. Vol. 2832. p. 605.doi:10.1007/978-3-540-39658-1_55.ISBN 978-3-540-20064-2. Retrieved2016-12-11.
  3. ^abFlajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007)."Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"(PDF).Discrete Mathematics and Theoretical Computer Science Proceedings.AH.Nancy, France:127–146.CiteSeerX 10.1.1.76.4286. Retrieved2016-12-11.
  4. ^Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010)."An optimal algorithm for the distinct elements problem"(PDF).Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data - PODS '10. p. 41.doi:10.1145/1807085.1807094.ISBN 978-1-4503-0033-9.S2CID 10006932. Retrieved2016-12-11.
  5. ^Leskovec, Rajaraman, Ullman (2014).Mining of Massive Datasets (2nd ed.). Cambridge University Press. p. 144. Retrieved2022-05-30.{{cite book}}: CS1 maint: multiple names: authors list (link)

Additional sources

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Flajolet–Martin_algorithm&oldid=1277031877"
Category:
Hidden category:

[8]ページ先頭

©2009-2025 Movatter.jp