Movatterモバイル変換

[0]ホーム

Jump to content

Aggregate function

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromDecomposable aggregation function)

Type of function in database management

Indatabase management, anaggregate function oraggregation function is afunction where multiple values are processed together to form a singlesummary statistic.

(Figure 1) Entity relationship diagram representation of aggregation.

Common aggregate functions include:

Others include:

Nanmean (mean ignoring NaN values, also known as "nil" or "null")
Stddev

Formally, an aggregate function takes as input aset, amultiset (bag), or alist from some input domainI and outputs an element of an output domainO.^[1] The input and output domains may be the same, such as forSUM, or may be different, such as forCOUNT.

Aggregate functions occur commonly in numerousprogramming languages, inspreadsheets, and inrelational algebra.

Thelistagg function, as defined in theSQL:2016 standard^[2]aggregates data from multiple rows into a single concatenated string.

In theentity relationship diagram, aggregation is represented as seen in Figure 1 with a rectangle around the relationship and its entities to indicate that it is being treated as an aggregate entity.^[3]

Decomposable aggregate functions

[edit]

Aggregate functions present abottleneck, because they potentially require having all input values at once. Indistributed computing, it is desirable to divide such computations into smaller pieces, and distribute the work, usuallycomputing in parallel, via adivide and conquer algorithm.

Some aggregate functions can be computed by computing the aggregate for subsets, and then aggregating these aggregates; examples includeCOUNT,MAX,MIN, andSUM. In other cases the aggregate can be computed by computing auxiliary numbers for subsets, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples includeAVERAGE (tracking sum and count, dividing at the end) andRANGE (tracking max and min, subtracting at the end). In other cases the aggregate cannot be computed without analyzing the entire set at once, though in some cases approximations can be distributed; examples includeDISTINCT COUNT (Count-distinct problem),MEDIAN, andMODE.

Such functions are calleddecomposable aggregation functions^[4] ordecomposable aggregate functions. The simplest may be referred to asself-decomposable aggregation functions, which are defined as those functionsf such that there is amerge operator⁠ $\diamond$ ⁠ such that

f(X\uplus Y)=f(X)\diamond f(Y)

where⁠ $\uplus$ ⁠ is the union of multisets (seemonoid homomorphism).

For example,SUM:

\operatorname {SUM} ({x})=x

, for a singleton;

\operatorname {SUM} (X\uplus Y)=\operatorname {SUM} (X)+\operatorname {SUM} (Y)

, meaning that merge⁠

\diamond

⁠ is simply addition.

COUNT:

\operatorname {COUNT} ({x})=1

\operatorname {COUNT} (X\uplus Y)=\operatorname {COUNT} (X)+\operatorname {COUNT} (Y)

MAX:

\operatorname {MAX} ({x})=x

\operatorname {MAX} (X\uplus Y)=\max {\bigl (}\operatorname {MAX} (X),\operatorname {MAX} (Y){\bigr )}

MIN:

{\textstyle \operatorname {MIN} ({x})=x}

,^[2]

\operatorname {MIN} (X\uplus Y)=\min {\bigl (}\operatorname {MIN} (X),\operatorname {MIN} (Y){\bigr )}

Note that self-decomposable aggregation functions can be combined (formally, taking the product) by applying them separately, so for instance one can compute both theSUM andCOUNT at the same time, by tracking two numbers.

More generally, one can define adecomposable aggregation functionf as one that can be expressed as the composition of a final functiong and a self-decomposable aggregation functionh, $f=g\circ h,f(X)=g(h(X))$ . For example,AVERAGE=SUM/COUNT andRANGE=MAX−MIN.

In theMapReduce framework, these steps are known as InitialReduce (value on individual record/singleton set), Combine (binary merge on two aggregations), and FinalReduce (final function on auxiliary values),^[5] and moving decomposable aggregation before the Shuffle phase is known as an InitialReduce step,^[6]

Decomposable aggregation functions are important inonline analytical processing (OLAP), as they allow aggregation queries to be computed on the pre-computed results in theOLAP cube, rather than on the base data.^[7] For example, it is easy to supportCOUNT,MAX,MIN, andSUM in OLAP, since these can be computed for each cell of the OLAP cube and then summarized ("rolled up"), but it is difficult to supportMEDIAN, as that must be computed for every view separately.

Other decomposable aggregate functions

[edit]

In order to calculate the average and standard deviation from aggregate data, it is necessary to have available for each group: the total of values (Σx_i = SUM(x)), the number of values (N=COUNT(x)) and the total of squares of the values (Σx_i²=SUM(x²)) of each groups.^[8]

AVG: $\operatorname {AVG} (X\uplus Y)={\bigl (}\operatorname {AVG} (X)*\operatorname {COUNT} (X)+\operatorname {AVG} (Y)*\operatorname {COUNT} (Y){\bigr )}/{\bigl (}\operatorname {COUNT} (X)+\operatorname {COUNT} (Y){\bigr )}$ or
$\operatorname {AVG} (X\uplus Y)={\bigl (}\operatorname {SUM} (X)+\operatorname {SUM} (Y){\bigr )}/{\bigl (}\operatorname {COUNT} (X)+\operatorname {COUNT} (Y){\bigr )}$ or, only if COUNT(X)=COUNT(Y)
$\operatorname {AVG} (X\uplus Y)={\bigl (}\operatorname {AVG} (X)+\operatorname {AVG} (Y){\bigr )}/2$
SUM(x²):The sum of squares of the values is important in order to calculate the Standard Deviation of groups
$\operatorname {SUM} (X^{2}\uplus Y^{2})=\operatorname {SUM} (X^{2})+\operatorname {SUM} (Y^{2})$
STDDEV:
For a finite population with equal probabilities at all points, we have^[9]^{[circular reference]} $\operatorname {STDDEV} (X)=s(x)={\sqrt {{\frac {1}{N}}\sum _{i=1}^{N}(x_{i}-{\overline {x}})^{2}}}={\sqrt {{\frac {1}{N}}\left(\sum _{i=1}^{N}x_{i}^{2}\right)-({\overline {x}})^{2}}}={\sqrt {\operatorname {SUM} (x^{2})/\operatorname {COUNT} (x)-\operatorname {AVG} (x)^{2}}}$

This means that the standard deviation is equal to the square root of the difference between the average of the squares of the values and the square of the average value. $\operatorname {STDDEV} (X\uplus Y)={\sqrt {\operatorname {SUM} (X^{2}\uplus Y^{2})/\operatorname {COUNT} (X\uplus Y)-\operatorname {AVG} (X\uplus Y)^{2}}}$ $\operatorname {STDDEV} (X\uplus Y)={\sqrt {{\bigl (}\operatorname {SUM} (X^{2})+\operatorname {SUM} (Y^{2}){\bigr )}/{\bigl (}\operatorname {COUNT} (X)+\operatorname {COUNT} (Y){\bigr )}-{\bigl (}(\operatorname {SUM} (X)+\operatorname {SUM} (Y))/(\operatorname {COUNT} (X)+\operatorname {COUNT} (Y)){\bigr )}^{2}}}$

References

[edit]

^Jesus, Baquero & Almeida 2011, 2 Problem Definition, pp. 3.
^^a ^bWinand, Markus (2017-05-15)."Big News in Databases: New SQL Standard, Cloud Wars, and ACIDRain (Spring 2017)". DZone. Archived fromthe original on 2017-05-27. Retrieved2017-06-10.In December 2016, ISO released a new version of the SQL standard. It introduces new features such as row pattern matching, listagg, date and time formatting, and JSON support.
^Elmasri, Ramez (2016).Fundamentals of database systems. Sham Navathe (Seventh ed.). Hoboken, NJ. p. 133.ISBN 978-0-13-397077-7.OCLC 913842106.{{cite book}}: CS1 maint: location missing publisher (link)
^Jesus, Baquero & Almeida 2011, 2.1 Decomposable functions, pp. 3–4.
^Yu, Gunda & Isard 2009, 2. Distributed Aggregation, pp. 2–4.
^Yu, Gunda & Isard 2009, 2. Distributed Aggregation, p. 1.
^Zhang 2017, p. 1.
^Ing. Óscar Bonilla, MBA
^Standard deviation#Identities and mathematical properties