Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Streaming algorithm

From Wikipedia, the free encyclopedia
Class of algorithms operating on data streams

Incomputer science,streaming algorithms are algorithms for processingdata streams in which the input is presented as asequence of items and can be examined in only a few passes, typicallyjust one. These algorithms are designed to operate with limited memory, generallylogarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

As a result of these constraints, streaming algorithms often produce approximate answers based on a summary or "sketch" of the data stream.

History

[edit]

Though streaming algorithms had already been studied by Munro and Paterson[1] as early as 1978, as well asPhilippe Flajolet and G. Nigel Martin in 1982/83,[2] the field of streaming algorithms was first formalized and popularized in a 1996 paper byNoga Alon,Yossi Matias, andMario Szegedy.[3] For this paper, the authors later won theGödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs,[4] in which the space allowed is linear in the number of verticesn, but only logarithmic in the number of edgesm. This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble ino(n){\displaystyle o(n)} space.

Models

[edit]

Data stream model

[edit]

In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available forrandom access, but instead arrives one at a time in a "stream".[5] If the stream has lengthn and the domain has sizem, algorithms are generally constrained to use space that islogarithmic inm andn. They can generally make only some small constant number of passes over the stream, sometimes justone.[6]

Turnstile and cash register models

[edit]

Much of the streaming literature is concerned with computing statistics onfrequency distributions that are too large to be stored. For this class ofproblems, there is a vectora=(a1,,an){\displaystyle \mathbf {a} =(a_{1},\dots ,a_{n})}(initialized to the zero vector0{\displaystyle \mathbf {0} }) that has updates presented to it in a stream. The goal of these algorithms is to computefunctions ofa{\displaystyle \mathbf {a} } using considerably less space than itwould take to representa{\displaystyle \mathbf {a} } precisely. There are twocommon models for updating such streams, called the "cash register" and"turnstile" models.[7]

In the cash register model, each update is of the formi,c{\displaystyle \langle i,c\rangle }, so thatai{\displaystyle a_{i}} is incremented by some positiveintegerc{\displaystyle c}. A notable special case is whenc=1{\displaystyle c=1}(only unit insertions are permitted).

In the turnstile model, each update is of the formi,c{\displaystyle \langle i,c\rangle }, so thatai{\displaystyle a_{i}} is incremented by some (possibly negative) integerc{\displaystyle c}. In the "strict turnstile" model, noai{\displaystyle a_{i}} at any time may be less than zero.

Sliding window model

[edit]

Several papers also consider the "sliding window" model.[citation needed] In this model,the function of interest is computing over a fixed-size window in thestream. As the stream progresses, items from the end of the window areremoved from consideration while new items from the stream take theirplace.

Besides the above frequency-based problems, some other types of problemshave also been studied. Many graph problems are solved in the settingwhere theadjacency matrix or theadjacency list of the graph is streamed insome unknown order. There are also some problems that are very dependenton the order of the stream (i.e., asymmetric functions), such as countingthe number of inversions in a stream and finding the longest increasingsubsequence.[citation needed]

Evaluation

[edit]
This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(April 2021) (Learn how and when to remove this message)

The performance of an algorithm that operates on data streams is measured by three basic factors:

  • The number of passes the algorithm must make over the stream.
  • The available memory.
  • The running time of the algorithm.

These algorithms have many similarities withonline algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an(ϵ,δ){\displaystyle (\epsilon ,\delta )} approximation meaning that the algorithm achieves an error of less thanϵ{\displaystyle \epsilon } with probability1δ{\displaystyle 1-\delta }.

Applications

[edit]

Streaming algorithms have several applications innetworking such asmonitoring network links forelephant flows, counting the number ofdistinct flows, estimating the distribution of flow sizes, and soon.[8] They also have applications indatabases, such as estimating the size of ajoin[citation needed].

Some streaming problems

[edit]

Frequency moments

[edit]

Thekth frequency moment of a set of frequenciesa{\displaystyle \mathbf {a} } is defined asFk(a)=i=1naik{\displaystyle F_{k}(\mathbf {a} )=\sum _{i=1}^{n}a_{i}^{k}}.

The first momentF1{\displaystyle F_{1}} is simply the sum of the frequencies (i.e., the total count). The second momentF2{\displaystyle F_{2}} is useful for computing statistical properties of the data, such as theGini coefficientof variation.F{\displaystyle F_{\infty }} is defined as the frequency of the most frequent items.

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.[citation needed]

Calculating frequency moments

[edit]

A direct approach to find the frequency moments requires to maintain a registermi for all distinct elementsai ∈ (1,2,3,4,...,N) which requires at least memoryof orderΩ(N){\displaystyle \Omega (N)}.[3] But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation ofFk, whereF'k is the (ε,δ)-approximated value ofFk.[9] Whereε is the approximation parameter andδ is the confidence parameter.[10]

CalculatingF0 (distinct elements in a data stream)
[edit]
Main article:Count-distinct problem
FM-Sketch algorithm
[edit]

Flajolet et al. in[2] introduced probabilistic method of counting which was inspired from a paper byRobert Morris.[11] Morris in his paper says that if the requirement of accuracy is dropped, a countern can be replaced by a counterlogn which can be stored inlog logn bits.[12] Flajolet et al. in[2] improved this method by using a hash functionh which is assumed to uniformly distribute the element in the hash space (a binary string of lengthL).

h:[m][0,2L1]{\displaystyle h:[m]\rightarrow [0,2^{L}-1]}

Letbit(y,k) represent the kth bit in binary representation ofy

y=k0bit(y,k)2k{\displaystyle y=\sum _{k\geq 0}\mathrm {bit} (y,k)*2^{k}}

Letρ(y){\displaystyle \rho (y)} represents the position of leastsignificant 1-bit in the binary representation ofyi with a suitable convention forρ(0){\displaystyle \rho (0)}.

ρ(y)={Min(k:bit(y,k)==1)if y>0Lif y=0{\displaystyle \rho (y)={\begin{cases}\mathrm {Min} (k:\mathrm {bit} (y,k)==1)&{\text{if }}y>0\\L&{\text{if }}y=0\end{cases}}}

LetA be the sequence of data stream of lengthM whose cardinality need to be determined. LetBITMAP [0...L − 1] be the

hash space where theρ(hashedvalues) are recorded. The below algorithm then determines approximate cardinality ofA.

Procedure FM-Sketch:    for i in 0 to L − 1 do        BITMAP[i] := 0     end for    for x in A: do        Index := ρ(hash(x))        if BITMAP[index] = 0 then            BITMAP[index] := 1        end if    end for    B := Position of left most 0 bit of BITMAP[]     return 2 ^ B

If there areN distinct elements in a data stream.

K-minimum value algorithm
[edit]

The previous algorithm describes the first attempt to approximateF0 in the data stream by Flajolet and Martin. Their algorithm picks a randomhash function which they assume to uniformly distribute the hash values in hash space.

Bar-Yossef et al. in[10] introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash functionh which can be normalized to [0,1] ash:[m][0,1]{\displaystyle h:[m]\rightarrow [0,1]}. But they fixed a limitt to number of values in hash space. The value oft is assumed of the orderO(1ε2){\displaystyle O\left({\dfrac {1}{\varepsilon _{2}}}\right)} (i.e. less approximation-valueε requires moret). KMV algorithm keeps onlyt-smallest hash values in the hash space. After all them values of stream have arrived,υ=Max(h(ai)){\displaystyle \upsilon =\mathrm {Max} (h(a_{i}))} is used to calculateF0=tυ{\displaystyle F'_{0}={\dfrac {t}{\upsilon }}}. That is, in a close-to uniform hash space, they expect at-leastt elements to be less thanO(tF0){\displaystyle O\left({\dfrac {t}{F_{0}}}\right)}.

Procedure 2 K-Minimum ValueInitialize first t values of KMV for a in a1 to an do    if h(a) < Max(KMV) then        Remove Max(KMV) from KMV set        Insert h(a) to KMV     end ifend for return t/Max(KMV)
Complexity analysis of KMV
[edit]

KMV algorithm can be implemented inO((1ε2)log(m)){\displaystyle O\left(\left({\dfrac {1}{\varepsilon _{2}}}\right)\cdot \log(m)\right)} memory bits space. Each hash value requires space of orderO(log(m)){\displaystyle O(\log(m))} memory bits. There are hash values of the orderO(1ε2){\displaystyle O\left({\dfrac {1}{\varepsilon _{2}}}\right)}. The access time can be reduced if we store thet hash values in a binary tree. Thus the time complexity will be reduced toO(log(1ε)log(m)){\displaystyle O\left(\log \left({\dfrac {1}{\varepsilon }}\right)\cdot \log(m)\right)}.

CalculatingFk
[edit]

Alon et al. estimatesFk by defining random variables that can be computed within given space and time.[3] The expected value of random variables gives the approximate value ofFk.

Assume length of sequencem is known in advance. Then construct a random variableX as follows:

AssumeS1 be of the orderO(n11/k/λ2){\displaystyle O(n^{1-1/k}/\lambda ^{2})} andS2 be of the orderO(log(1/ε)){\displaystyle O(\log(1/\varepsilon ))}. Algorithm takesS2 random variableY1,Y2,...,YS2{\displaystyle Y_{1},Y_{2},...,Y_{S2}} and outputs the medianY{\displaystyle Y} . WhereYi is the average ofXij where 1 ≤jS1.

Now calculate expectation of random variableE(X).

E(X)=i=1ni=1mi(jk(j1)k)=mm[(1k+(2k1k)++(m1k(m11)k))+(1k+(2k1k)++(m2k(m21)k))++(1k+(2k1k)++(mnk(mn1)k))]=i=1nmik=Fk{\displaystyle {\begin{array}{lll}E(X)&=&\sum _{i=1}^{n}\sum _{i=1}^{m_{i}}(j^{k}-(j-1)^{k})\\&=&{\frac {m}{m}}[(1^{k}+(2^{k}-1^{k})+\ldots +(m_{1}^{k}-(m_{1}-1)^{k}))\\&&\;+\;(1^{k}+(2^{k}-1^{k})+\ldots +(m_{2}^{k}-(m_{2}-1)^{k}))+\ldots \\&&\;+\;(1^{k}+(2^{k}-1^{k})+\ldots +(m_{n}^{k}-(m_{n}-1)^{k}))]\\&=&\sum _{i=1}^{n}m_{i}^{k}=F_{k}\end{array}}}
Complexity ofFk
[edit]

From the algorithm to calculateFk discussed above, we can see that each random variableX stores value ofap andr. So, to computeX we need to maintain onlylog(n) bits for storingap andlog(n) bits for storingr. Total number of random variableX will be theS1S2{\displaystyle S_{1}*S_{2}}.

Hence the total space complexity the algorithm takes is of the order ofO(klog1ελ2n11k(logn+logm)){\displaystyle O\left({\dfrac {k\log {1 \over \varepsilon }}{\lambda ^{2}}}n^{1-{1 \over k}}\left(\log n+\log m\right)\right)}

Simpler approach to calculateF2
[edit]

The previous algorithm calculatesF2{\displaystyle F_{2}} in order ofO(n(logm+logn)){\displaystyle O({\sqrt {n}}(\log m+\log n))} memory bits. Alon et al. in[3] simplified this algorithm using four-wise independent random variable with values mapped to{1,1}{\displaystyle \{-1,1\}}.

This further reduces the complexity to calculateF2{\displaystyle F_{2}} toO(log1ελ2(logn+logm)){\displaystyle O\left({\dfrac {\log {1 \over \varepsilon }}{\lambda ^{2}}}\left(\log n+\log m\right)\right)}

Frequent elements

[edit]

In the data stream model, thefrequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is themajority problem, which is to determine whether or not any value constitutes a majority of the stream.

More formally, fix some positive constantc > 1, let the length of the stream bem, and letfi denote the frequency of valuei in the stream. The frequent elements problem is to output theset {i |fi > m/c }.[13]

Some notable algorithms are:

Event detection

[edit]

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weightedmoving averages and variance for normalization.[14]

Counting distinct elements

[edit]

Counting the number of distinct elements in a stream (sometimes called theF0 moment) is another problem that has been well studied.The first algorithm for it was proposed by Flajolet and Martin. In 2010,Daniel Kane,Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem.[15] It usesO(ε2 + logd) space, withO(1) worst-case update and reporting times, as well as universal hash functions and ar-wise independent hash family wherer = Ω(log(1/ε) / log log(1/ε)).

Entropy

[edit]

The (empirical) entropy of a set of frequenciesa{\displaystyle \mathbf {a} } isdefined asFk(a)=i=1naimlogaim{\displaystyle F_{k}(\mathbf {a} )=\sum _{i=1}^{n}{\frac {a_{i}}{m}}\log {\frac {a_{i}}{m}}}, wherem=i=1nai{\displaystyle m=\sum _{i=1}^{n}a_{i}}.

Online learning

[edit]
Main article:Online machine learning

Learn a model (e.g. aclassifier) by a single pass over a training set.


Lower bounds

[edit]

Lower bounds have been computed for many of the data streaming problemsthat have been studied. By far, the most common technique for computingthese lower bounds has been usingcommunication complexity.[citation needed]

See also

[edit]

Notes

[edit]
  1. ^Munro, J. Ian; Paterson, Mike (1978). "Selection and Sorting with Limited Storage".19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16–18 October 1978. IEEE Computer Society. pp. 253–258.doi:10.1109/SFCS.1978.32.
  2. ^abcFlajolet & Martin (1985)
  3. ^abcdAlon, Matias & Szegedy (1996)
  4. ^Feigenbaum, Joan; Sampath, Kannan (2005)."On graph problems in a semi-streaming model".Theoretical Computer Science.348 (2):207–216.doi:10.1016/j.tcs.2005.09.013.
  5. ^Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002). "Models and issues in data stream systems".Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '02. New York, NY, USA: ACM. pp. 1–16.CiteSeerX 10.1.1.138.190.doi:10.1145/543613.543615.ISBN 978-1-58113-507-7.S2CID 2071130.
  6. ^Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.;Trevisan, Luca (2002-09-13). "Counting Distinct Elements in a Data Stream".Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science. Vol. 2483. Springer, Berlin, Heidelberg. pp. 1–10.CiteSeerX 10.1.1.12.6276.doi:10.1007/3-540-45726-7_1.ISBN 978-3-540-45726-8.S2CID 4684185.
  7. ^Gilbert et al. (2001)
  8. ^Xu (2007)
  9. ^Indyk, Piotr; Woodruff, David (2005-01-01). "Optimal approximations of the frequency moments of data streams".Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. STOC '05. New York, NY, USA: ACM. pp. 202–208.doi:10.1145/1060590.1060621.ISBN 978-1-58113-960-0.S2CID 7911758.
  10. ^abBar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). Rolim, José D. P.; Vadhan, Salil (eds.).Counting Distinct Elements in a Data Stream. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–10.CiteSeerX 10.1.1.12.6276.doi:10.1007/3-540-45726-7_1.ISBN 978-3-540-44147-2.S2CID 4684185.
  11. ^Morris (1978)
  12. ^Flajolet, Philippe (1985-03-01). "Approximate counting: A detailed analysis".BIT Numerical Mathematics.25 (1):113–134.CiteSeerX 10.1.1.64.5320.doi:10.1007/BF01934993.ISSN 0006-3835.S2CID 2809103.
  13. ^Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.).Encyclopedia of Algorithms. Springer US. pp. 1–5.doi:10.1007/978-3-642-27848-8_572-1.ISBN 978-3-642-27848-8.
  14. ^Schubert, E.; Weiler, M.;Kriegel, H. P. (2014).SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. pp. 871–880.doi:10.1145/2623330.2623740.ISBN 978-1-4503-2956-9.
  15. ^Kane, Nelson & Woodruff (2010)

References

[edit]
Data structures
Algorithms andalgorithmic paradigms
Retrieved from "https://en.wikipedia.org/w/index.php?title=Streaming_algorithm&oldid=1292564081"
Category:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp