CROSS REFERENCE TO RELATED APPLICATIONSThis application is a division of co-pending U.S. patent application Ser. No. 12/165,524, filed Jun. 30, 2008, which in turn is a continuation of U.S. patent application Ser. No. 11/068,137, filed Feb. 28, 2005 (now U.S. Pat. No. 7,610,397), both of which are herein incorporated by reference in their entireties.
REFERENCE TO GOVERNMENT FUNDINGThis invention was made with Government support under Contract No.: H98230-04-3-001 awarded by the U.S. Department of Defense. The Government has certain rights in this invention.
BACKGROUNDThe present invention relates generally to data stream processing and relates more particularly to the optimization of data stream operations.
With the proliferation of Internet connections and network-connected sensor devices comes an increasing rate of digital information available from a large number of online sources. These online sources continually generate and provide data (e.g., news items, financial data, sensor readings, Internet transaction records, and the like) to a network in the form of data streams. Data stream processing units are typically implemented in a network to receive or monitor these data streams and process them to produce results in a usable format. For example, a data stream processing unit may be implemented to perform a join operation in which related data items from two or more data streams (e.g., from two or more news sources) are culled and then aggregated or evaluated, for example to produce a list of results or to corroborate each other.
However, the input rates of typical data streams present a challenge. Because data stream processing units have no control over the sometimes sporadic and unpredictable rates at which data streams are input, it is not uncommon for a data stream processing unit to become loaded beyond its capacity, especially during rate spikes. Typical data stream processing units deal with such loading problems by arbitrarily dropping data streams (e.g., declining to receive the data streams). While this does reduce loading, the arbitrary nature of the strategy tends to result in unpredictable and sub-optimal data processing results, because data streams containing useful data may unknowingly be dropped while data streams containing irrelevant data are retained and processed.
Thus, there is a need in the art for a method and apparatus for adaptive load shedding.
SUMMARY OF THE INVENTIONOne embodiment of the present method and apparatus adaptive load shedding includes receiving at least one data stream (comprising a plurality of tuples, or data items) in a first sliding window of memory. A subset of tuples from the received data stream is then selected for processing in accordance with at least one data stream operation, such as a data stream join operation. Tuples that are not selected for processing are ignored. The number of tuples selected and the specific tuples selected depend at least in part on a variety of dynamic parameters, including the rate at which the data stream (and any other processed data streams) is received, time delays associated with the received data stream, a direction of a join operation performed on the data stream and the values of the individual tuples with respect to an expected output.
BRIEF DESCRIPTION OF THE DRAWINGSSo that the manner in which the above recited embodiments of the invention are attained and can be understood in detail, a more particular description of the invention, briefly summarized above, may be obtained by reference to the embodiments thereof which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
FIG. 1 is a schematic diagram illustrating one embodiment of a data stream processing unit adapted for use with the present invention;
FIG. 2 is a flow diagram illustrating one embodiment of a method for adaptive load shedding for data stream processing according to the present invention;
FIG. 3 is a flow diagram illustrating one embodiment of a method for determining the quantity of data to be processed, in accordance with the method illustrated inFIG. 2;
FIG. 4 is a schematic diagram illustrating the basis for one embodiment of an adaptive tuple selection method based on time correlation;
FIG. 5 is a flow diagram illustrating one embodiment of a method for prioritizing sub-windows of a given sliding window for use in tuple selection, in accordance with the method illustrated inFIG. 2;
FIG. 6 is a flow diagram illustrating one embodiment of a method for selecting tuples for processing, in accordance with the method illustrated inFIG. 2;
FIG. 7 is a flow diagram illustrating another embodiment of a method for selecting tuples for processing, in accordance with the method illustrated inFIG. 2; and
FIG. 8 is a flow diagram illustrating yet another embodiment of a method for selecting tuples for processing, in accordance with the method illustrated inFIG. 2.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
DETAILED DESCRIPTIONIn one embodiment, the present invention is a method and apparatus for adaptive load shedding, e.g., for data stream operations. Embodiments of the present invention make it possible for load shedding to be performed in an “intelligent” (e.g., non-arbitrary) manner, thereby maximizing the quality of the data stream operation output (e.g., in terms of a total number of output items generated or in terms of the value of the output generated).
Within the context of the present invention, the term “tuple” may be understood to be a discrete data item within a stream of data (e.g., where the stream of data may comprise multiple tuples).
FIG. 1 is a schematic diagram illustrating one embodiment of a datastream processing unit100 adapted for use with the present invention. The datastream processing unit100 illustrated inFIG. 1 is configured as a general purpose computing device and is further configured for performing data stream joins. Although the present invention will be described within the exemplary context of data stream joins, those skilled in the art will appreciate that the teachings of the invention described herein may be applied to optimize a variety of data stream operations, including filtering, transforming and the like.
As illustrated, the datastream processing unit100 is configured to receive two or more input data streams1021-102n(hereinafter collectively referred to as “input data streams102”), e.g., from two or more different data sources (not shown), and processes theseinput data streams102 to produce a singleoutput data stream104. The datastream processing unit100 thus comprises a processor (e.g., a central processing unit or CPU)106, a memory108 (such as a random access memory, or RAM) and a storage device110 (such as a disk drive, an optical disk drive, a floppy disk drive and the like). Those skilled in the art will appreciate that some data stream processing units may be configured to receive only a single input data stream and still be adaptable for use with the present invention.
As eachinput data stream102 is received by the datastream processing unit100, tuples (e.g., discrete data items) from theinput data streams102 are stored in a respective sliding window1121-112n(hereinafter collectively referred to as “slidingwindows112”) in thememory108. These sliding windows can be user-configurable or system-defined (e.g., based on available memory space) and may be count-based (e.g. configured to store “the last x tuples” of the input data streams) or time-based (e.g., configured to store “the last x seconds” of the input data streams). Thus, as a new tuple from aninput data stream102 arrives in a respectivesliding window112, the new tuple may force an existing tuple to leave the sliding window112 (if thesliding window112 was full before receipt of the new tuple). Thememory108 also stores program logic for the adaptive load shedding method of the present invention, as well as logic for other miscellaneous applications. Alternatively, portions of the input data stream and program logic can be stored on thestorage medium110.
To perform a join operation, theprocessor106 executes the program logic stored in thememory108 to process tuples from theinput data streams102 that are stored in thesliding windows112. Specifically, the join operation is performed by comparing a tuple (e.g., tuple x) from a first slidingwindow1121with at least one tuple from at least a second slidingwindow112n. If one or more tuples from the second sliding window112n(e.g., tuples y, v, and u) match the join condition for the tuple x, then the matching tuples will be joined such that theoutput data stream104 will comprise one or more matched sets of tuples, e.g., (x, y), (x, v) and (x, u).
Thus, the adaptive load shedding method of the present invention may be represented by one or more software application (or even a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC)), where the software is loaded from a storage medium (e.g., storage device110) and operated by theprocessor106 in thememory108 of the datastream processing unit100. Thus, in one embodiment, the method for adaptive load shedding described in greater detail below can be stored on a computer readable medium or carrier (e.g., RAM, magnetic or optical driven or diskette, and the like).
Alternatively, the method for adaptive load shedding described in greater detail below can be represented as a discrete load shedding module (e.g., a physical device or subsystem that is coupled to theprocessor106 through a communication channel) within the data stream processing unit.
FIG. 2 is a flow diagram illustrating one embodiment of amethod200 for adaptive load shedding for data stream processing according to the present invention. Themethod200 may be implemented at, for example, a data stream processing unit such as the datastream processing unit100 illustrated inFIG. 1.
Themethod200 is initialized atstep202 and proceeds tostep204, where themethod200 receives at least one input data stream. The input data stream is received, for example, within a sliding window of memory as discussed with reference toFIG. 1. Themethod200 then proceeds tostep206 and determines what resources are available to process the input data stream.
Instep208, themethod200 determines, based at least in part on the availability of processing resources, the quantity of data (e.g., how many tuples from within the sliding window) that should be processed. In one embodiment, a determination of how much data should be processed is based at least in part on the rate at which the input data stream is currently being received.
Themethod200 then proceeds tostep210 and, based on the amount of data to be processed, selects specific tuples from within the sliding window for processing. Thus, the number of tuples selected for processing will not exceed the total amount of data that was identified for processing instep208. Tuples not selected instep210 are then shed (e.g., not processed). In one embodiment, selection of specific tuples for processing is based at least in part on at least one of: the values of the tuples (e.g., tuples most closely related to the purpose motivating the data stream processing operation), the time correlation between two or more tuples, and the join direction of the data stream processing application (e.g., where themethod200 is being implemented to shed load for a data stream join).
Instep212, themethod200 processes the selected tuples in accordance with at least one data stream operation (e.g., joining, filtering, transforming and the like). Received tuples that are not selected for processing are ignored, meaning that the un-selected tuples are not immediately processed, but may be processed at a later point in time, e.g., if the processing resources become available and if the un-selected tuples are still present in a sliding window of memory. Themethod200 then terminates instep214.
Themethod200 thereby optimizes a data stream operation by intelligently shedding load. Rather than processing every tuple in a sliding window of memory (and shedding load by arbitrarily discarding tuples before they can even enter the sliding window), themethod200 allows all tuples to enter the sliding window and then processes only selected tuples from within the sliding window based on one or more stated parameters and on resource availability. Thus all data provided to themethod200 remains available for processing, but only a tailored subset of this available data is actually processed. Thus, themethod200 maximizes the quality of the data stream operation output for a given set of available processing resources.
FIG. 3 is a flow diagram illustrating one embodiment of amethod300 for determining the quantity of data (e.g., number of tuples) to be processed, e.g., in accordance withstep208 of themethod200. Themethod300 enables the quantity of data that is selected for processing to be adjusted according to the rate at which new data is being received (e.g., the rates at which input data streams are arriving), thereby facilitating efficient use of processing resources.
Themethod300 is initialized atstep302 and proceeds to step304, where themethod300 sets a fraction, r, of the data in each sliding window to be processed to a default value. In one embodiment, the default value is either one or zero, with a default value of one implying an assumption that a stream join operation can be performed fully without any knowledge of data streams yet to be received. In one embodiment, this fraction, r, is applied to all sliding windows containing available tuples for processing.
At substantially the same time that the value for r is set, themethod300 proceeds to step306 and sets the time, t, to T. Themethod300 then proceeds to step308 and calculates an adaptation factor, β, where β is based on the numbers of tuples fetched from the sliding windows since a last execution of themethod300 and on the arrival rates of the input data streams in the sliding windows since the last execution of themethod300. In one embodiment, β is calculated as:
where α1-nis a number of tuples fetched from a designated sliding window (e.g., where n sliding windows are being processed) since the last execution of themethod300, λ1-nis the average arrival rate of an input data stream in a designated sliding window since the last execution of themethod300, and Tris the adaptation period (e.g., such that themethod300 is configured to execute every Trseconds). In one embodiment, the size of the adaptation period Tris selected to be adaptive to “bursty” or sporadic data input rates. In one embodiment, Tris approximately 5 seconds.
Once the adaptation factor β is calculated, themethod300 proceeds to step310 and determines whether β is less than one. If themethod300 concludes that β0 is less than one, themethod300 proceeds to step312 and re-sets r to β*r, which effectively results in smaller fractions of the sliding windows being selected for processing. Alternatively, if themethod300 concludes that β is greater than or equal to one, themethod300 proceeds to step314 and re-sets r to the smaller value of one and δr*r, which effectively results in larger fractions of the sliding windows being selected for processing. In this case, δris a fraction boost factor. In one embodiment, the fraction boost factor δris predefined by a user or by the data stream processing unit. In one embodiment, the fraction boost factor δris approximately 1.2. Those skilled in the art will appreciate that selecting higher values for the fraction boost factor δrwill result in a more aggressive increase of the fraction, r.
Once the value of r has been appropriately re-set, themethod300 proceeds to step316 and waits for the time t to equal T+Tr. That is, themethod300 waits for the start of the next adaptation period. Once t=T+Tr, themethod300 returns to step306 and proceeds as described above so that the fractions r of the sliding windows that are selected for processing continually adapt to the arrival rates of the input data streams. In this manner, load shedding can be adapted based on available processing resources even when the arrival rates of input data streams are sporadic or unpredictable.
Once the amount of data to be processed is determined, specific tuples may be selected for processing from within each sliding window. One method in which tuples may be selected for processing adapts the selection process according to temporal correlations between tuples by prioritizing tuples according to the times in which the tuples were generated or entered the sliding windows.
FIG. 4 is a schematic diagram illustrating the basis for one embodiment of an adaptive tuple selection method based on time correlation. Consider a data stream processing unit that is configured to receive x input data streams S1-Sxin x respective sliding windows W1-Wx. Sliding window Si, where iε[1, . . . , x], is a representative sliding window. Sliding window Sicontains a total of wiseconds worth of tuples and is divided into n sub-windows Bi,1-Bi,neach containing b seconds worth of tuples (i.e., such that n=1+[wi/b]).
Those skilled in the art will appreciate that the size, b, of each sub-window is subject to certain considerations. For example, smaller sub-windows are better for capturing the peak of a match probability distribution, but because there is a larger number of sub-windows, more processing capacity is needed. On the other hand, larger sub-windows are less costly from a processing standpoint, but are less adaptive to the dynamic natures of the input data streams.
New tuples enter the sliding window Siat a first sub-window Bi,1and continue to enter the first sub-window Bi,1until the most recent tuple to enter the first sub-window Bi,1has a timestamp that is b seconds longer than the timestamp of the first tuple to enter first sub-window Bi,1. At this point, all tuples residing in the last sub-window Bi,nare discarded, and all sub-windows shift over by one position (i.e., so that the last sub-window Bi,nbecomes the first sub-window, the first sub-window Bi,1becomes the second sub-window, and so on). Thus, the sliding window Wiis maintained as a circular buffer, and tuples do not move from one sub-window to another, but remain in a single sub-window until that sub-window is emptied and shifts to the position of the first sub-window Bi,1.
As will be discussed in greater detail below, one embodiment of an adaptive tuple selection method based on time correlation establishes a time correlation period, Tc, where the method executes once every Tcseconds to adapt the manner in which specific tuples are selected based on time correlation between incoming tuples. In the case where the tuple selection method is implemented in conjunction with a data stream join operation, one of two tuple processing methods may be performed between two consecutive executions of the tuple selection method. These two tuple processing methods are referred to as full processing and selective processing. Full processing involves comparing a newly input tuple from a first sliding window against all tuples in at least a second sliding window. Selective processing involves comparing a newly input tuple from a first sliding window against tuples contained only in high-priority sub-windows of at least a second sliding window. As will be described in greater detail below, in one embodiment sub-windows are prioritized based on a number of output tuples expected to be produced by comparing the newly input tuple from the first sliding window against tuples in each sub-window of the second sliding window.
Whether a newly input tuple is subjected to full or selective processing is dictated by the tuple's respective sampling probability, γ. The probability of a newly input tuple being subjected to full processing is r*γ; conversely, the probability of the same newly input tuple being subjected to selective processing is 1−r*γ. Thus, the fraction, r, that is determined by themethod300 is used to scale the sampling probability γ so that full processing will not consume all processing resources during periods of heavy loading. In one embodiment, the sampling probability γ is predefined by a user or by the data stream processing unit. The value of the sampling probability γ should be small enough to avoid undermining selective processing, but large enough to be able to capture a match probability distribution. In one embodiment, the sampling probability γ is set to approximately 0.1.
Full processing facilitates the collection of statistics that indicate the “usefulness” of the first sliding window's sub-windows for the data stream join operation. In one embodiment, full processing calculates, for each sub-window Bi,j, a number of expected output tuples produced by comparing a newly input tuple with a tuple from the sub-window Bi,j. This number of expected output tuples may be referred to as oi,j. Specifically, during full processing, for each match found with a tuple in the sub-window Bi,j, a counter ôi,jis incremented. The ôi,jvalues are later normalized (e.g., by γ*r*b*λ1* . . . *λn*Tc) to calculate the number of expected output tuples oi,j.
FIG. 5 is a flow diagram illustrating one embodiment of amethod500 for prioritizing sub-windows of a given sliding window for use in tuple selection, e.g., in accordance withstep210 of themethod200. Specifically, themethod500 enables sub-windows of a sliding window to be sorted based on time delays (e.g., application dependent or communication related) between matching tuples in the sliding window and tuples to be compared there against, thereby facilitating the selection of the most relevant tuples for processing.
Themethod500 is initialized atstep502 and proceeds to step504, where themethod500 sets the current time, t, to T and sets i to one, where i identifies a sliding window to be examined (e.g., sliding window WiofFIG. 4). Themethod500 then proceeds to step506 and sorts the sub-windows of the selected sliding window into an array, O. Specifically, the sub-windows are sorted in descending order based on their respective numbers of expected output tuples (un-normalized), ôi,j, such that {ôi,j|jε[1, . . . , n]}.
Themethod500 then proceeds to step508 and, for each sub-window, Bi,j, (where jε[1, . . . , n]), calculates new values for the respective numbers of expected output tuples, oi,j, and sij. In this case, sij=k means that the jthitem in the sorted list {oi,l/lε[1, . . . , n]} is oi,k, where the list {oi,l/lε[1, . . . , n]} is sorted in descending order. In one embodiment, the new value for oi,jis calculated as:
and sij=k, where O[j]=ôi,j.
Instep510, themethod500 then resets all ôi,jvalues to zero. Themethod500 then proceeds to step512 and sets i to i+1, e.g., so that themethod500 focuses on the next sliding window to be examined. Thus, themethod500 inquires, atstep514, if i is now less than 3. If themethod500 determines that i is less than three, themethod500 returns to step506 and proceeds as described above, e.g., in order to analyze the sub-windows of the next sliding window to be examined.
Alternatively, if themethod500 determines instep514 that i is not less than three, themethod500 proceeds to step516 and waits until the time, t, is T+Tc. That is, themethod500 waits for the start of the next time correlation adaptation period. Once the next time correlation adaptation period starts, themethod500 returns to step504 and proceeds as described above so that the oi,jand sijvalues of each sub-window under examination continually adapt to the temporal correlations between the incoming data streams.
FIG. 6 is a flow diagram illustrating one embodiment of amethod600 for selecting tuples for processing, e.g., in accordance withstep212 of themethod200. Specifically, themethod600 processes a given tuple, y, from a first sliding window W1against one or more selected tuples in a second sliding window W2. As will be described in further detail below, themethod600 exploits knowledge gained from themethod500 regarding the prioritizing of sub-windows within the second sliding window W2.
Themethod600 is initialized atstep602 and proceeds to step604, where themethod600 identifies the tuple, y, for processing from the first sliding window W1. In one embodiment, the identified tuple, y, is a newly received tuple. Themethod600 also identifies the second window W2against which to process the identified tuple y, e.g., in accordance with a data stream join operation.
Instep606, themethod600 obtains or generates a random number R. Themethod600 then proceeds to step608 and inquires if R is less than r*γ. If themethod600 determines that R is less than r*γ, themethod600 proceeds to step612 and commences full processing on the tuple y from the first window W1.
Specifically, instep612, themethod600 processes the tuple y from the first sliding window W1against all tuples in the second sliding window W2in accordance with at least one data stream operation (e.g., a join). Themethod600 then proceeds to step614 and, for each matched output in each sub-window of the second sliding window W2, increments the sub-window's un-normalized output count ôi,j(e.g., by one). Themethod600 then terminates instep628.
Alternatively, if themethod600 determines instep608 that R is not less than r*γ, themethod600 proceeds to step610 and commences selective processing on the tuple y from the first sliding window W1. Specifically, instep610, themethod600 determines the number of tuples to be processed from the second sliding window W2. In one embodiment, the number of tuples to be processed, a, is calculated as:
a=r*|W1| (EQN. 3)
where |W1| is the size of the first sliding window W1(e.g., as measured in terms of a number of tuples or a duration of time contained within the first sliding window W1).
Themethod600 then proceeds to step616 and starts to processes the tuple y from the first window W1against tuples in the second sliding window W2, starting with the highest priority sub-window in the second sliding window W2(e.g., Bi, sij) and working through the remaining sub-windows in descending order of priority until the tuple y from the first sliding window W1has been processed against a tuples from the second sliding window W2. Specifically, instep616, themethod600 inquires whether any sub-windows remain for processing in the second sliding window W2(e.g., whether the current sub-window is the last sub-window). If themethod600 concludes that no sub-windows remain for processing in the second sliding window W2, themethod600 terminates instep628.
Alternatively, if themethod600 concludes instep616 that there are sub-windows that are available for processing in the second sliding window W2, themethod600 proceeds to step618 and adjusts the number of tuples available for processing in the second sliding window W2to account for the tuples contained in the first available sub-window (e.g., the highest-priority available sub-window, Bi, sij). That is, themethod600 subtracts the number of tuples in the first available sub-window from the total number of tuples, a, to be selected for processing from the second sliding window W2(e.g., such that the new value for a=a−|Bi, sij|). Thus, a−|Bi, sij| more tuples from the second sliding window W2may still be processed against the tuple y from the first sliding window W1.
Themethod600 then proceeds to step620 and inquires whether any more tuples from the second sliding window W2are available for processing (e.g., whether the adjusted a>0). If themethod600 concludes that a number of tuples in the second sliding window W2can still be processed, themethod600 proceeds to step624 and processes the tuple y from the first sliding window W1against all tuples in the first available sub-window Bi,sijof the second sliding window W2. Themethod600 then proceeds to step626 and proceeds to the next available (e.g., next-highest priority) sub-window in the second sliding window W2. Themethod600 then returns to step616 and proceeds as described above in order to determine how many tuples from the next available sub-window can be used for processing.
Alternatively, if themethod600 concludes instep620 that no more tuples can be processed from the second sliding window W2(e.g., that the adjusted a is <0), themethod600 proceeds to step622 and processes the tuple y from the first sliding window W1against a fraction of the tuples contained within the first available sub-window Bi,sij. In one embodiment, this fraction, rc, is calculated as:
where rcis a fraction with a value in the range of zero to one. Once the tuple y from the first sliding window W1has been processed against the fraction rcof the first available sub-window Bi, sij, themethod600 terminates instep628.
In yet another embodiment, once the amount of data to be processed is determined, specific tuples may be selected for processing from within each sliding window based on the join direction of a data stream join operation. The “direction” of a data stream join operation is defined by the numbers of tuples that are processed from each input data stream (e.g., if more tuples are being processed from a first data stream S1than a second data stream, S2, the join is in the direction of the second data stream S2). Because of the time delay difference between data streams, one direction of a data stream join operation may be more valuable than the opposite direction. For example, comparing a single tuple from the first data stream S1against many tuples from a second data stream S2may produce more usable output than the converse operation. Thus, in this case, load shedding should be performed in the converse direction (e.g., more tuples should be shed from the first sliding window W1than the second sliding window W2).
FIG. 7 is a flow diagram illustrating one embodiment of amethod700 for selecting tuples for processing, e.g., in accordance withstep212 of themethod200. Specifically, themethod700 determines the individual fractions, r1and r2, that should be applied, respectively, to process a fraction of the tuples in first and second sliding windows W1and W2. This optimizes the direction of the join operation in order to maximize output.
Themethod700 is initialized atstep702 and proceeds to step704, where themethod700 sets the fraction r1of the first sliding window W1to be processed to one. Themethod700 also sets the fraction r2of the second sliding window W2to be processed to one. Themethod700 then proceeds to step706 and computes a generic r value, e.g., in accordance with themethod300.
Instep708, themethod700 computes the expected numbers of output tuples o1and o2to be produced, respectively, by the first and second sliding windows W1and W2. In one embodiment, the values for o1and o2calculated as:
where i indicates the specific sliding window W1or W2for which the expected number of output tuples is being calculated (e.g., i being 1 or 2 in this example), niis the total number of sub-windows in the sliding window (e.g., W1or W2) under consideration, and j indicating any sub-window1-nwithin the sliding window W1or W2under consideration.
Once the expected numbers of output tuples o1and o2are calculated for each sliding window W1and W2, themethod700 proceeds to step710 and inquires if o1≧o2. If themethod700 determines that o1is greater than or equal to o2, themethod700 proceeds to step712 and re-sets r1to the smaller of one and
Alternatively, if themethod700 determines instep710 that o1is not greater than or equal to o2, themethod700 proceeds to step714 and re-sets r1to the larger of zero and
Instep716, once the new value for r1has been computed, themethod700 calculates a new value for r2. In one embodiment, r2is calculated as:
themethod700 then terminates instep718.
As described herein, themethods500,600 and700 are aimed at maximizing the number of output tuples, oi,j, generated by a data stream processing operation given limited processing resources. However, for some data stream processing operations, it may be desirable to maximize not just the quantity, but the value of the output data. Thus, in one embodiment, each tuple received via an input data stream is associated with an importance value, which is defined by the type of tuple and specified by a utility value attached to that type of tuple.
In one embodiment, the type of a tuple, y, is defined as Z(y)=zεZ. The utility value of the same tuple, y, is thus defined as V(Z(y))=V(z). In one embodiment, type and utility value parameters are set based on application needs. For example, in news matching applications (e.g., where tuples representing news items from two or more different sources are matched), tuples representing news items can be assigned utility values from the domain [1, . . . , 10], where a value of 10 is assigned to the tuples representing the most important news items. Moreover, the frequency of appearance of a tuple of type z in an input data stream Siis denoted as fi,z.
Thus, in one embodiment, load shedding may be performed in a manner that sheds a proportionally smaller number of tuples of types that provide higher output utilities. The extra processing resources that are allocated to process these high-output utility tuple types are balanced by shedding a proportionally larger number of tuple types having low output utilities. This can be accomplished by applying different processing fractions, ri,z, to different types of tuples, based on the output utilities of those types of tuples. In one embodiment, the expected output utility obtained from comparing a tuple y of type z from a first sliding window W1with a tuple in a second sliding window W2is denoted as ui,zand is used to determine ri,zvalues.
The computation of ri,zcan be formulated as a fractional knapsack problem having a greedy optimal solution. For example, consider Ii,j,zas an item that represents the processing of a tuple y of type z (from the first sliding window W1) against sub-window Bi,jof the second sliding window W2. Item Ii,j,zhas a volume of λ1*λ2*wi,z*b units and a value of λ1*λ2*wi,z*ui,z*b*pi, sijunits, where pi,jdenotes the probability of a match for sub-window Bi,j. Thus, the goal is to select a maximum number of items, where fractional items are acceptable, so that the total value is maximized and the total volume of the selected items is at most λ1*λ2*r*(w1+w2). Here, ri,j,zε[0, . . . , 1] is used to denote how much of item Ii,j,zis selected. Those skilled in the art will appreciate that the number of unknown variables (e.g., ri,j,z′s) can be calculated as (B1,n+B2,n)*|Z|, and the solution of the original problem (e.g., determining a value for ri,z) can be calculated from these variables as:
In one embodiment, the values of the fraction variables (e.g., ri,j,z′s) are determined during a join direction adaptation (e.g., as described in the method700). In one embodiment, a simple way to do this is to sort the items Ii,j,zbased on their respective value over volume ratios, ui,z, *pi,sij, and to select as much as possible of the item Ii,j,zthat is most valuable per unit volume. However, since the total number of items Ii,j,zmay be large, this sorting can be costly in terms of processing resources, especially for a large number of sub-windows and larger sized tuple types.
Thus, in another embodiment, use is made of the sijvalues that define an order between value over volume ratios of items Ii,j,zfor a fixed type z and sliding window Wi. Items Ii,j,zrepresenting different data streams S and different types z with the highest value over volume ratios are maintained in a heap H. An item Ii,j,zis iteratively selected from the heap H and replaced with an item Ii,j,zhaving the next highest value over volume ratio with the same data stream and same type subscript index. This iterative process repeats until a capacity constraint is reached.
FIG. 8 is a flow diagram illustrating one embodiment of amethod800 for selecting tuples for processing, e.g., in accordance withstep212 of themethod200. Specifically, themethod800 selects tuples for processing based not only on an optimal join direction, but also on the respective values of the tuples as discussed above.
Themethod800 is initialized atstep802 and proceeds to step804, where themethod800 calculates a fraction, r, of the tuples to be processed (e.g., in accordance with the method300) and also establishes a heap, H.
Themethod800 then proceeds to step806 and sets an initial value of ri,zto zero and an initial value of νi, Si1,zto ui,z*pi, si1, where νi, si1zis the value over volume ratio of the item Ii, si1,z. Instep808, themethod800 initializes the heap, H, with νi, si1,z|iε[1, . . . , 2], zεZ] and sets the total number of tuples to be processed, a, to a=*λ1*λ2*r*(w1+w2).
Once the heap, H, has been initialized and the number of tuples to be processed, a, set, themethod800 proceeds to step810 and inquires if the heap, H, is empty. If themethod800 concludes that the heap, H, is empty, themethod800 terminates instep824.
Alternatively, if themethod800 determines instep810 that the heap, H, is not empty, themethod800 proceeds to step812 and selects the first (e.g., topmost) item from the heap, H. Moreover, based on the selection of the first item, themethod800 adjusts the total number of tuples, a, that can still be processed. In one embodiment, the total number of tuples, a, is now a−(wi,z*λ1*λ2*b).
Themethod800 then proceeds to step814 and inquires if the adjusted value of a is still greater than zero. If themethod800 concludes that a is not greater than zero (e.g., no more tuples can be processed after subtracting the first item from the heap, H), themethod800 proceeds to step816 and adjusts the fraction rcof the first available sub-window to be processed such that:
Moreover, themethod800 re-sets ri,zto
Themethod800 then terminates instep824.
Alternatively, if themethod800 determines instep814 that a is greater than zero (e.g., tuples remain available for processing after subtracting the first item from the heap, H), themethod800 proceeds to step818 and re-sets ri,zto
Themethod800 then proceeds to step820 and determines whether the current sub-window, j, from which the last processed item was taken is the last sub-window, n (e.g., whether j<n) in the sliding window under examination. If the current sub-window j is not the last sub-window, n (e.g., if j<n), then themethod800 proceeds to step822 and sets νi, sij+1,z=ui,z*pi, si1and inserts νi, sij+1into the heap, H. Themethod800 then returns to step810 and proceeds as described above. Alternatively, if themethod800 determines instep820 than the current sub-window, j, is the last sub-window, n (e.g., j=n), themethod800 bypasses step822 and returns directly to step810.
Thus, the present invention represents a significant advancement in the field of data stream processing. The present invention allows all incoming data streams to be received in memory, but selects only a subset of the tuples contained within the received data streams for processing, based on available processing resources and on one or more characteristics of the subset of tuples. The invention thus makes it possible for load shedding to be performed in an “intelligent” (e.g., non-arbitrary) manner, thereby maximizing the quality of the data stream operation output.
While foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.