Movatterモバイル変換

Jump to content

Streaming algorithm

From Wikipedia, the free encyclopedia

Class of algorithms operating on data streams

Incomputer science,streaming algorithms are algorithms for processingdata streams in which the input is presented as asequence of items and can be examined in only a few passes, typicallyjust one. These algorithms are designed to operate with limited memory, generallylogarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

As a result of these constraints, streaming algorithms often produce approximate answers based on a summary or "sketch" of the data stream.

History

Though streaming algorithms had already been studied by Munro and Paterson^[1] as early as 1978, as well asPhilippe Flajolet and G. Nigel Martin in 1982/83,^[2] the field of streaming algorithms was first formalized and popularized in a 1996 paper byNoga Alon,Yossi Matias, andMario Szegedy.^[3] For this paper, the authors later won theGödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs,^[4] in which the space allowed is linear in the number of verticesn, but only logarithmic in the number of edgesm. This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in $o(n)$ space.

Models

Data stream model

In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available forrandom access, but instead arrives one at a time in a "stream".^[5] If the stream has lengthn and the domain has sizem, algorithms are generally constrained to use space that islogarithmic inm andn. They can generally make only some small constant number of passes over the stream, sometimes justone.^[6]

Turnstile and cash register models

Much of the streaming literature is concerned with computing statistics onfrequency distributions that are too large to be stored. For this class ofproblems, there is a vector $\mathbf {a} =(a_{1},\dots ,a_{n})$ (initialized to the zero vector $\mathbf {0}$ ) that has updates presented to it in a stream. The goal of these algorithms is to computefunctions of $\mathbf {a}$ using considerably less space than itwould take to represent $\mathbf {a}$ precisely. There are twocommon models for updating such streams, called the "cash register" and"turnstile" models.^[7]

In the cash register model, each update is of the form $\langle i,c\rangle$ , so that $a_{i}$ is incremented by some positiveinteger $c {\displaystyle c}$ . A notable special case is when $c=1$ (only unit insertions are permitted).

In the turnstile model, each update is of the form $\langle i,c\rangle$ , so that $a_{i}$ is incremented by some (possibly negative) integer $c {\displaystyle c}$ . In the "strict turnstile" model, no $a_{i}$ at any time may be less than zero.

Sliding window model

Several papers also consider the "sliding window" model.^{[citation needed]} In this model,the function of interest is computing over a fixed-size window in thestream. As the stream progresses, items from the end of the window areremoved from consideration while new items from the stream take theirplace.

Besides the above frequency-based problems, some other types of problemshave also been studied. Many graph problems are solved in the settingwhere theadjacency matrix or theadjacency list of the graph is streamed insome unknown order. There are also some problems that are very dependenton the order of the stream (i.e., asymmetric functions), such as countingthe number of inversions in a stream and finding the longest increasingsubsequence.^{[citation needed]}

Evaluation

This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(April 2021) (Learn how and when to remove this message)

The performance of an algorithm that operates on data streams is measured by three basic factors:

The number of passes the algorithm must make over the stream.
The available memory.
The running time of the algorithm.

These algorithms have many similarities withonline algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an $(\epsilon ,\delta )$ approximation meaning that the algorithm achieves an error of less than $\epsilon$ with probability $1-\delta$ .

Applications

Streaming algorithms have several applications innetworking such asmonitoring network links forelephant flows, counting the number ofdistinct flows, estimating the distribution of flow sizes, and soon.^[8] They also have applications indatabases, such as estimating the size of ajoin^{[citation needed]}.

Some streaming problems

Frequency moments

Thekth frequency moment of a set of frequencies $\mathbf {a}$ is defined as $F_{k}(\mathbf {a} )=\sum _{i=1}^{n}a_{i}^{k}$ .

The first moment $F_{1}$ is simply the sum of the frequencies (i.e., the total count). The second moment $F_{2}$ is useful for computing statistical properties of the data, such as theGini coefficientof variation. $F_{\infty }$ is defined as the frequency of the most frequent items.

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.^{[citation needed]}

Calculating frequency moments

A direct approach to find the frequency moments requires to maintain a registerm_i for all distinct elementsa_i ∈ (1,2,3,4,...,N) which requires at least memoryof order $\Omega (N)$ .^[3] But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation ofF_k, whereF'_k is the (ε,δ)-approximated value ofF_k.^[9] Whereε is the approximation parameter andδ is the confidence parameter.^[10]

CalculatingF₀ (distinct elements in a data stream)

Main article:Count-distinct problem

FM-Sketch algorithm

Flajolet et al. in^[2] introduced probabilistic method of counting which was inspired from a paper byRobert Morris.^[11] Morris in his paper says that if the requirement of accuracy is dropped, a countern can be replaced by a counterlogn which can be stored inlog logn bits.^[12] Flajolet et al. in^[2] improved this method by using a hash functionh which is assumed to uniformly distribute the element in the hash space (a binary string of lengthL).

h:[m]\rightarrow [0,2^{L}-1]

Letbit(y,k) represent the kth bit in binary representation ofy

y=\sum _{k\geq 0}\mathrm {bit} (y,k)*2^{k}

Let $\rho (y)$ represents the position of leastsignificant 1-bit in the binary representation ofy_i with a suitable convention for $\rho (0)$ .

\rho (y)={\begin{cases}\mathrm {Min} (k:\mathrm {bit} (y,k)==1)&{\text{if }}y>0\\L&{\text{if }}y=0\end{cases}}

LetA be the sequence of data stream of lengthM whose cardinality need to be determined. LetBITMAP [0...L − 1] be the

hash space where theρ(hashedvalues) are recorded. The below algorithm then determines approximate cardinality ofA.

Procedure FM-Sketch:    for i in 0 to L − 1 do        BITMAP[i] := 0     end for    for x in A: do        Index := ρ(hash(x))        if BITMAP[index] = 0 then            BITMAP[index] := 1        end if    end for    B := Position of left most 0 bit of BITMAP[]     return 2 ^ B

If there areN distinct elements in a data stream.

For $i\gg \log(N)$ thenBITMAP[i] is certainly 0
For $i\ll \log(N)$ thenBITMAP[i] is certainly 1
For $i\approx \log(N)$ thenBITMAP[i] is a fringes of 0's and 1's

K-minimum value algorithm

The previous algorithm describes the first attempt to approximateF₀ in the data stream by Flajolet and Martin. Their algorithm picks a randomhash function which they assume to uniformly distribute the hash values in hash space.

Bar-Yossef et al. in^[10] introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash functionh which can be normalized to [0,1] as $h:[m]\rightarrow [0,1]$ . But they fixed a limitt to number of values in hash space. The value oft is assumed of the order $O\left({\dfrac {1}{\varepsilon _{2}}}\right)$ (i.e. less approximation-valueε requires moret). KMV algorithm keeps onlyt-smallest hash values in the hash space. After all them values of stream have arrived, $\upsilon =\mathrm {Max} (h(a_{i}))$ is used to calculate $F'_{0}={\dfrac {t}{\upsilon }}$ . That is, in a close-to uniform hash space, they expect at-leastt elements to be less than $O\left({\dfrac {t}{F_{0}}}\right)$ .

Procedure 2 K-Minimum ValueInitialize first t values of KMV for a in a1 to an do    if h(a) < Max(KMV) then        Remove Max(KMV) from KMV set        Insert h(a) to KMV     end ifend for return t/Max(KMV)

Complexity analysis of KMV

KMV algorithm can be implemented in $O\left(\left({\dfrac {1}{\varepsilon _{2}}}\right)\cdot \log(m)\right)$ memory bits space. Each hash value requires space of order $O(\log(m))$ memory bits. There are hash values of the order $O\left({\dfrac {1}{\varepsilon _{2}}}\right)$ . The access time can be reduced if we store thet hash values in a binary tree. Thus the time complexity will be reduced to $O\left(\log \left({\dfrac {1}{\varepsilon }}\right)\cdot \log(m)\right)$ .

CalculatingF_k

Alon et al. estimatesF_k by defining random variables that can be computed within given space and time.^[3] The expected value of random variables gives the approximate value ofF_k.

Assume length of sequencem is known in advance. Then construct a random variableX as follows:

Selecta_p be a random member of sequenceA with index atp, $a_{p}=l\in (1,2,3,\ldots ,n)$
Let $r=|\{q:q\geq p,a_{q}=l\}|$ , represents the number of occurrences ofl within the members of the sequenceA followinga_p.
Random variable $X=m(r^{k}-(r-1)^{k})$ .

AssumeS₁ be of the order $O(n^{1-1/k}/\lambda ^{2})$ andS₂ be of the order $O(\log(1/\varepsilon ))$ . Algorithm takesS₂ random variable $Y_{1},Y_{2},...,Y_{S2}$ and outputs the median $Y {\displaystyle Y}$ . WhereY_i is the average ofX_ij where 1 ≤j ≤S₁.

Now calculate expectation of random variableE(X).

{\begin{array}{lll}E(X)&=&\sum _{i=1}^{n}\sum _{i=1}^{m_{i}}(j^{k}-(j-1)^{k})\\&=&{\frac {m}{m}}[(1^{k}+(2^{k}-1^{k})+\ldots +(m_{1}^{k}-(m_{1}-1)^{k}))\\&&\;+\;(1^{k}+(2^{k}-1^{k})+\ldots +(m_{2}^{k}-(m_{2}-1)^{k}))+\ldots \\&&\;+\;(1^{k}+(2^{k}-1^{k})+\ldots +(m_{n}^{k}-(m_{n}-1)^{k}))]\\&=&\sum _{i=1}^{n}m_{i}^{k}=F_{k}\end{array}}

Complexity ofF_k

From the algorithm to calculateF_k discussed above, we can see that each random variableX stores value ofa_p andr. So, to computeX we need to maintain onlylog(n) bits for storinga_p andlog(n) bits for storingr. Total number of random variableX will be the⁠ $S_{1}*S_{2}$ ⁠.

Hence the total space complexity the algorithm takes is of the order of $O\left({\dfrac {k\log {1 \over \varepsilon }}{\lambda ^{2}}}n^{1-{1 \over k}}\left(\log n+\log m\right)\right)$

Simpler approach to calculateF₂

The previous algorithm calculates $F_{2}$ in order of $O({\sqrt {n}}(\log m+\log n))$ memory bits. Alon et al. in^[3] simplified this algorithm using four-wise independent random variable with values mapped to $\{-1,1\}$ .

This further reduces the complexity to calculate $F_{2}$ to $O\left({\dfrac {\log {1 \over \varepsilon }}{\lambda ^{2}}}\left(\log n+\log m\right)\right)$

Frequent elements

In the data stream model, thefrequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is themajority problem, which is to determine whether or not any value constitutes a majority of the stream.

More formally, fix some positive constantc > 1, let the length of the stream bem, and letf_i denote the frequency of valuei in the stream. The frequent elements problem is to output theset {i |f_i > m/c }.^[13]

Some notable algorithms are:

Event detection

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weightedmoving averages and variance for normalization.^[14]

Counting distinct elements

Counting the number of distinct elements in a stream (sometimes called theF₀ moment) is another problem that has been well studied.The first algorithm for it was proposed by Flajolet and Martin. In 2010,Daniel Kane,Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem.^[15] It usesO(ε² + logd) space, withO(1) worst-case update and reporting times, as well as universal hash functions and ar-wise independent hash family wherer = Ω(log(1/ε) / log log(1/ε)).

Entropy

The (empirical) entropy of a set of frequencies $\mathbf {a}$ isdefined as $F_{k}(\mathbf {a} )=\sum _{i=1}^{n}{\frac {a_{i}}{m}}\log {\frac {a_{i}}{m}}$ , where $m=\sum _{i=1}^{n}a_{i}$ .

Online learning

Main article:Online machine learning

Learn a model (e.g. aclassifier) by a single pass over a training set.

Lower bounds

Lower bounds have been computed for many of the data streaming problemsthat have been studied. By far, the most common technique for computingthese lower bounds has been usingcommunication complexity.^{[citation needed]}

See also

Notes

^Munro, J. Ian; Paterson, Mike (1978). "Selection and Sorting with Limited Storage".19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16–18 October 1978. IEEE Computer Society. pp. 253–258.doi:10.1109/SFCS.1978.32.
^^a ^b ^cFlajolet & Martin (1985)
^^a ^b ^c ^dAlon, Matias & Szegedy (1996)
^Feigenbaum, Joan; Sampath, Kannan (2005)."On graph problems in a semi-streaming model".Theoretical Computer Science.348 (2):207–216.doi:10.1016/j.tcs.2005.09.013.
^Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002). "Models and issues in data stream systems".Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '02. New York, NY, USA: ACM. pp. 1–16.CiteSeerX 10.1.1.138.190.doi:10.1145/543613.543615.ISBN 978-1-58113-507-7.S2CID 2071130.
^Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.;Trevisan, Luca (2002-09-13). "Counting Distinct Elements in a Data Stream".Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science. Vol. 2483. Springer, Berlin, Heidelberg. pp. 1–10.CiteSeerX 10.1.1.12.6276.doi:10.1007/3-540-45726-7_1.ISBN 978-3-540-45726-8.S2CID 4684185.
^Gilbert et al. (2001)
^Xu (2007)
^Indyk, Piotr; Woodruff, David (2005-01-01). "Optimal approximations of the frequency moments of data streams".Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. STOC '05. New York, NY, USA: ACM. pp. 202–208.doi:10.1145/1060590.1060621.ISBN 978-1-58113-960-0.S2CID 7911758.
^^a ^bBar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). Rolim, José D. P.; Vadhan, Salil (eds.).Counting Distinct Elements in a Data Stream. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–10.CiteSeerX 10.1.1.12.6276.doi:10.1007/3-540-45726-7_1.ISBN 978-3-540-44147-2.S2CID 4684185.
^Morris (1978)
^Flajolet, Philippe (1985-03-01). "Approximate counting: A detailed analysis".BIT Numerical Mathematics.25 (1):113–134.CiteSeerX 10.1.1.64.5320.doi:10.1007/BF01934993.ISSN 0006-3835.S2CID 2809103.
^Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.).Encyclopedia of Algorithms. Springer US. pp. 1–5.doi:10.1007/978-3-642-27848-8_572-1.ISBN 978-3-642-27848-8.
^Schubert, E.; Weiler, M.;Kriegel, H. P. (2014).SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. pp. 871–880.doi:10.1145/2623330.2623740.ISBN 978-1-4503-2956-9.
^Kane, Nelson & Woodruff (2010)

References

Alon, Noga;Matias, Yossi;Szegedy, Mario (1999), "The space complexity of approximating the frequency moments",Journal of Computer and System Sciences,58 (1):137–147,doi:10.1006/jcss.1997.1545,ISSN 0022-0000. First published asAlon, Noga; Matias, Yossi; Szegedy, Mario (1996), "The space complexity of approximating the frequency moments",Proceedings of the 28th ACM Symposium on Theory of Computing (STOC 1996), pp. 20–29,CiteSeerX 10.1.1.131.4984,doi:10.1145/237814.237823,ISBN 978-0-89791-785-8,S2CID 1627911.
Babcock, Brian; Babu, Shivnath; Datar, Mayur;Motwani, Rajeev;Widom, Jennifer (2002), "Models and issues in data stream systems",Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002)(PDF), pp. 1–16,CiteSeerX 10.1.1.138.190,doi:10.1145/543613.543615,ISBN 978-1-58113-507-7,S2CID 2071130, archived fromthe original(PDF) on 2017-07-09, retrieved2013-07-15.
Flajolet, Philippe; Martin, G. Nigel (1985)."Probabilistic counting algorithms for data base applications"(PDF).Journal of Computer and System Sciences.31 (2):182–209.doi:10.1016/0022-0000(85)90041-8. Retrieved2016-12-11.
Gilbert, A. C.; Kotidis, Y.;Muthukrishnan, S.; Strauss, M. J. (2001),"Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries"(PDF),Proceedings of the International Conference on Very Large Data Bases:79–88.
Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An optimal algorithm for the distinct elements problem".Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '10. New York, NY, USA: ACM. pp. 41–52.CiteSeerX 10.1.1.164.142.doi:10.1145/1807085.1807094.ISBN 978-1-4503-0033-9.S2CID 10006932..
Karp, R. M.;Papadimitriou, C. H.;Shenker, S. (2003), "A simple algorithm for finding frequent elements in streams and bags",ACM Transactions on Database Systems,28 (1):51–55,CiteSeerX 10.1.1.116.8530,doi:10.1145/762471.762473,S2CID 952840.
Lall, Ashwin; Sekar, Vyas; Ogihara, Mitsunori; Xu, Jun; Zhang, Hui (2006). "Data streaming algorithms for estimating entropy of network traffic".Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2006). p. 145.doi:10.1145/1140277.1140295.hdl:1802/2537.ISBN 978-1-59593-319-5.S2CID 240982..
Xu, Jun (Jim) (2007),A Tutorial on Network Data Streaming(PDF).
Heath, D., Kasif, S., Kosaraju, R., Salzberg, S., Sullivan, G., "Learning Nested Concepts With Limited Storage", Proceeding IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Pages 777–782, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA ©1991
Morris, Robert (1978), "Counting large numbers of events in small registers",Communications of the ACM,21 (10):840–842,doi:10.1145/359619.359627,S2CID 36226357.

v t e Data structures andalgorithms
Data structures	Array Associative array Binary search tree Fenwick tree Graph Hash table Heap Linked list Queue Segment tree Stack String Tree Trie
Algorithms andalgorithmic paradigms	Backtracking Binary search Breadth-first search Brute-force search Depth-first search Divide and conquer Dynamic programming Graph traversal Fold Greedy Hash function Minimax Online Randomized Recursion Root-finding Sorting Streaming Sweep line String-searching Topological sorting
List of data structures List of algorithms

Retrieved from "https://en.wikipedia.org/w/index.php?title=Streaming_algorithm&oldid=1292564081"

Streaming algorithms

Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp