Sketches

GoogleSQL for BigQuery supports data sketches.A data sketch is a compact summary of a data aggregation. It captures all thenecessary information to either extract an aggregation result, continue adata aggregation, or merge it with another sketch, enabling re-aggregation.

Computing a metric using a sketch is substantially less expensive than computingan exact value. If your computation is too slow or requires too much temporarystorage, use sketches to reduce query time and resources.

Additionally, computingcardinalities, such as thenumber of distinct users, orquantiles, such asmedian visit duration, without sketches is usually only possible by running jobsover the raw data because already-aggregated data can't be combined anymore.

Consider a table with the following data:

Product	Number of users	Median visit duration
Product A	500 million	10 minutes
Product B	20 million	2 minutes

Computing the total number of users for both products isn't possible because wedon't know how many users used both products in the table.Likewise, computing the median visit duration isn't possible because thedistribution of the visit durations has been lost.

A solution is to store sketches in the table instead. Each sketch is anapproximate and compact representation of a particular input property, such ascardinality, that you can store, merge (or re-aggregate), and query fornear-exact results. In the previous example, you can estimate the number ofdistinct users for Product A and Product B by creating and merging(re-aggregating) the sketches for each product. Youcan also estimate the medianvisit duration with quantile sketches that you can likewise merge and query.

For example, the following query usesHLL++ andKLLsketches to estimate distinct users and median visit duration for YouTube(Product A) and Google Maps (Product B):

-- Build sketches for YouTube stats.CREATETABLEuser.YOUTUBE_ACCESS_STATSASSELECTHLL_COUNT.INIT(user_id)ASdistinct_users_sketch,KLL_QUANTILES.INIT_INT64(visit_duration_ms)ASvisit_duration_ms_sketch,hour_of_dayFROMYOUTUBE_ACCESS_LOG()GROUPBYhour_of_day;-- Build sketches for Maps stats.CREATETABLEuser.MAPS_ACCESS_STATSASSELECTHLL_COUNT.INIT(user_id)ASdistinct_users_sketch,KLL_QUANTILES.INIT_INT64(visit_duration_ms)ASvisit_duration_ms_sketch,hour_of_dayFROMMAPS_ACCESS_LOG()GROUPBYhour_of_day;-- Query YouTube hourly stats.SELECTHLL_COUNT.EXTRACT(distinct_users_sketch)ASdistinct_users,KLL_QUANTILES.EXTRACT_POINT_INT64(visit_duration_ms_sketch,0.5)ASmedian_visit_duration,hour_of_dayFROMuser.YOUTUBE_ACCESS_STATS;-- Query YouTube daily stats.SELECTHLL_COUNT.MERGE(distinct_users_sketch),KLL_QUANTILES.MERGE_POINT_INT64(visit_duration_ms_sketch,0.5)ASmedian_visit_duration,dateFROMuser.YOUTUBE_ACCESS_STATSGROUPBYdate;-- Query total stats across YouTube and Maps.SELECTHLL_COUNT.MERGE(distinct_users_sketch)ASunique_users_all_services,KLL_QUANTILES.MERGE_POINT_INT64(visit_duration_ms_sketch,0.5)ASmedian_visit_duration_all_services,FROM(SELECT*FROMuser.YOUTUBE_ACCESS_STATSUNIONALLSELECT*FROMuser.MAPS_ACCESS_STATS);

Because a sketch has lossy compression of the original data, it introduces astatistical error that's represented by an error bound or confidence interval(CI). For most applications, this uncertainty is small. For example, a typicalcardinality-counting sketch has a relative error of about 1% in 95% of allcases. A sketch trades some accuracy, orprecision, for faster and lessexpensive computations, and less storage.

In summary, a sketch has these main properties:

Represents an approximate aggregate for a specific metric
Is compact
Is a serialized form of an in-memory, sublinear data structure
Is typically a fixed size and asymptotically smaller than the input
Can introduce a statistical error that you determine with a precisionlevel
Can be merged with other sketches to summarize the union of the underlyingdata sets

Re-aggregation with sketch merging

Sketches let you store and merge data for efficient re-aggregation. This makessketches particularly useful for materialized views of data sets. You can mergesketches to construct a summary of multiple data streams based on partialsketches created for each stream.

For example, if you create a sketch for the estimated number of distinct usersevery day, you can get the number of distinct users during the last seven daysby merging daily sketches. Re-aggregating the merged daily sketches helps youavoid reading the full input of the data set.

Sketch re-aggregation is also useful in online analytical processing (OLAP). Youcan merge sketches to create a roll-up of anOLAP cube, where thesketch summarizes data along one or more specific dimensions of the cube. OLAProll-ups aren't possible with true distinct counts.

Which type of sketch should I use?

Different sketching algorithms are designed for different types of metrics, suchasHLL++ for distinct counts andKLL for quantiles. GoogleSQL also providesApproximate aggregate functions that you can use to querythese types of data without having to specify query details such as precisionlevel.

The sketch that you use depends on the type of data that you need to estimate.

Estimate cardinality

If you need to estimatecardinality, use anHLL++ sketch.

For example, to get the number of unique users who actively used a product in agiven month (MAU or 28DAU metrics), use an HLL++ sketch.

Compute a quantile

If you need to get aquantile of a data set, useaKLL sketch.

For example, to get the median visit duration of customers ina store, or to track the 95th percentile latency that tickets stay in a queuebefore being addressed, use a KLL sketch.

HLL++ sketches

HyperLogLog++ (HLL++) is a sketching algorithm for estimating cardinality. HLL++is based on the paperHyperLogLog in Practice, where the++ denotes the augmentations made to the HyperLogLog algorithm.

Cardinality is the number of distinct elements in theinput for a sketch. For example, you could use an HLL++ sketch to get the numberof unique users who have opened an application.

HLL++ estimates very small and very large cardinalities. HLL++ includes a64-bit hash function, sparse representation to reduce memory requirements forsmall cardinality estimates, and empirical bias correction forsmall cardinality estimates.

Precision

HLL++ sketches support custom precision. The following table shows the supportedprecision values, the maximum storage size, and the confidence interval (CI) oftypical precision levels:

Precision	Max storage size	65% CI	95% CI	99% CI
10	1 KiB + 28 B	±3.25%	±6.50%	±9.75%
11	2 KiB + 28 B	±2.30%	±4.60%	±6.89%
12	4 KiB + 28 B	±1.63%	±3.25%	±4.88%
13	8 KiB + 28 B	±1.15%	±2.30%	±3.45%
14	16 KiB + 30 B	±0.81%	±1.63%	±2.44%
15 (default)	32 KiB + 30 B	±0.57%	±1.15%	±1.72%
16	64 KiB + 30 B	±0.41%	±0.81%	±1.22%
17	128 KiB + 30 B	±0.29%	±0.57%	±0.86%
18	256 KiB + 30 B	±0.20%	±0.41%	±0.61%
19	512 KiB + 30 B	±0.14%	±0.29%	±0.43%
20	1024 KiB + 30 B	±0.10%	±0.20%	±0.30%
21	2048 KiB + 32 B	±0.07%	±0.14%	±0.22%
22	4096 KiB + 32 B	±0.05%	±0.10%	±0.15%
23	8192 KiB + 32 B	±0.04%	±0.07%	±0.11%
24	16384 KiB + 32 B	±0.03%	±0.05%	±0.08%

You can define precision for an HLL++ sketch when you initialize it with theHLL_COUNT.INIT function.

Deletion

You can't delete values from an HLL++ sketch.

Additional details

For a list of functions that you can use with HLL++ sketches, seeHLL++ functions.

Sketch integration

You can integrate HLL++ sketches with other systems. For example, you can buildsketches in external applications, likeDataflow,Apache Spark, andZetaSketch and then consume them in GoogleSQL or vice versa.

In addition to GoogleSQL, you can use HLL++ sketcheswithJava.

KLL sketches

KLL (short for Karnin-Lang-Liberty) is a streaming algorithm to compute sketchesfor approximatequantiles. It computes arbitrary quantiles far moreefficiently than exact computations at the price of a small approximation error.

Precision

KLL sketches support custom precision. Precision defines the exactness of areturned approximate quantileq.

By default, the rank of an approximate quantile can be at most±1/1000 * n offfrom⌈Φ * n⌉, wheren is the number of rows in the input and⌈Φ * n⌉ isthe rank of the exact quantile.

If you provide custom precision, the rank of the approximate quantile can beat most±1/precision * n off from the rank of the exact quantile. The erroris within this error bound in 99.999% of cases. This error guarantee onlyapplies to the difference between exact and approximate ranks. The numericaldifference between the exact and approximated value for a quantile can bearbitrarily large.

For example, suppose you want to find the median value,Φ = 0.5, and you usethe default precision of1000. Then the rank of the value returned by theKLL_QUANTILES.EXTRACT_POINT function differs from the true rank by at mostn/1000 in 99.999% of cases. In other words, the returned value is almostalways between the 49.9th and 50.1st percentiles. If you have 1,000,000 items inyour sketch, then the rank of the returned median is almost always between499,000 and 501,000.

If you use a custom precision of100 to find the median value, then the rankof the value returned by theKLL_QUANTILES.EXTRACT_POINT function differs fromthe true rank by at mostn/100 in 99.999% of cases. In other words, thereturned value is almost always between the 49th and 51st percentiles. If youhave 1,000,000 items in your sketch, then the rank of the returned median isalmost always between 490,000 and 510,000.

You can define precision for a KLL sketch when you initialize it with theKLL_QUANTILES.INIT function.

Size

KLL sketch size depends on the precision parameter and the input type.If your input type isINT64, the sketches can use additional optimizationthat's especially helpful if the input values come from a small universe. Thefollowing table contains two columns forINT64. One column provides anupper bound on sketch size for items from a limited universe of size 1B, and asecond column provides an upper bound for arbitrary input values.

Precision	FLOAT64	INT64 (<1B)	INT64 (Any)
10	761 B	360 B	717 B
20	1.46 KB	706 B	1.47 KB
50	3.49 KB	1.72 KB	3.60 KB
100	6.94 KB	3.44 KB	7.12 KB
200	13.87 KB	6.33 KB	13.98 KB
500	35.15 KB	14.47 KB	35.30 KB
1000	71.18 KB	27.86 KB	71.28 KB
2000	144.51 KB	55.25 KB	144.57 KB
5000	368.87 KB	139.54 KB	368.96 KB
10000	749.82 KB	282.27 KB	697.80 KB
20000	1.52 MB	573.16 KB	1.37 MB
50000	3.90 MB	1.12 MB	3.45 MB
100000	7.92 MB	2.18 MB	6.97 MB

Phi

Phi (Φ) represents the quantile to produce as a fraction of the total number ofrows in sketch input, normalized between 0 and 1. If a function supportsphi, the function returns a valuev such that approximately Φ *n inputsare less than or equal tov, and (1-Φ) *n inputs are greater than orequal tov.

Additional details

For a list of functions that you can use with KLL sketches, seeKLL quantile functions.

The KLL algorithm is defined in the paperOptimal Quantile Approximation in Streams, and is namedafter its authors, Karnin, Lang, and Liberty, who published the paper in 2016.The KLL algorithm improves the olderMP80 algorithm by using variable-size buffers to reducememory use for large data sets, reducing the sketch size fromO(log n) toO(1). Due to the non-deterministic nature of the algorithm, sketches createdon the same set of data with the same precision might not be identical.

Quantiles

Quantiles are cut points dividing the range of aprobability distribution into continuous intervals with equal probabilities, ordividing the observations in a sample in the same way. A quantile-supportingsketch lets you estimate quantiles by summarizing those intervals andprobabilities into near-exact quantile results.

Quantiles are typically defined in two ways:

For a positive integerq,q-quantiles are a set of values that partitionan input set intoq subsets of nearly equal size. Some of these havespecific names: the single 2-quantile is the median, the 4-quantiles arequartiles, the 100-quantiles are percentiles, etc. KLL functionsadditionally return the (exact) minimum and the maximum of the input, sowhen querying for the 2-quantiles, three values are returned.
Tip: To extract a set ofq-quantiles whereq is thenumber argument,use theMERGE andEXTRACT functions in theKLL_QUANTILES.* functions.
Alternatively, quantiles might be considered individualΦ-quantiles, whereΦ is a real number with0 <= Φ <= 1. TheΦ-quantilex is an elementof the input such that aΦ fraction of the input is less than or equal tox, and a(1-Φ) fraction is greater than or equal tox. In thisnotation, the median is the 0.5-quantile, and the 95th percentile is the0.95-quantile.
Tip: To extract individualΦ-quantiles, use the quantile-supportingMERGE_POINT andEXTRACT_POINT functions,whereΦ is thephi argument.

For example, you can use a quantiles-supporting sketch to get the median of thenumber of times an application is opened by users.

Approximate aggregate functions

As an alternative to specific sketch-based approximation functions,GoogleSQL provides predefined approximate aggregatefunctions. These approximate aggregate functions support sketches for commonestimations such as distinct count, quantiles, and top count, but they don'tallow custom precision. They also don't expose and store the sketch forre-aggregation like other types of sketches. The approximate aggregate functionsare designed for running quick sketch-based queries without detailedconfiguration.

For a list of approximate aggregate functions that you can use withsketch-based approximation, see Approximate aggregate functions.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-05 UTC.

Movatterモバイル変換

Sketches Stay organized with collections Save and categorize content based on your preferences.