Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit84f9a35

Browse files
committed
Improve estimate of distinct values in estimate_num_groups().
When adjusting the estimate for the number of distinct values from arel in a grouped query to take into account the selectivity of therel's restrictions, use a formula that is less likely to produceunder-estimates.The old formula simply multiplied the number of distinct values in therel by the restriction selectivity, which would be correct if therestrictions were fully correlated with the grouping expressions, butcan produce significant under-estimates in cases where they are notwell correlated.The new formula is based on the random selection probability, and soassumes that the restrictions are not correlated with the groupingexpressions. This is guaranteed to produce larger estimates, and ofcourse risks over-estimating in cases where the restrictions arecorrelated, but that has less severe consequences thanunder-estimating, which might lead to a HashAgg that consumes anexcessive amount of memory.This could possibly be improved upon in the future by identifyingcorrelated restrictions and using a hybrid of the old and newformulae.Author: Tomas Vondra, with some hacking be meReviewed-by: Mark Dilger, Alexander Korotkov, Dean Rasheed and Tom LaneDiscussion:http://www.postgresql.org/message-id/flat/56CD0381.5060502@2ndquadrant.com
1 parentbf08f22 commit84f9a35

File tree

2 files changed

+64
-25
lines changed

2 files changed

+64
-25
lines changed

‎src/backend/utils/adt/selfuncs.c

Lines changed: 53 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3237,15 +3237,15 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
32373237
*restriction selectivity of the equality in the next step.
32383238
*4. For Vars within a single source rel, we multiply together the numbers
32393239
*of values, clamp to the number of rows in the rel (divided by 10 if
3240-
*more than one Var), and then multiply bythe selectivity of the
3241-
*restriction clauses for that rel. When there's more than one Var,
3242-
*the initial product is probably too high (it's the worst case) but
3243-
*clamping to a fraction of the rel's rows seems to be a helpful
3244-
*heuristic for not letting the estimate get out of hand. (The factor
3245-
*of 10 is derived from pre-Postgres-7.4 practice.)Multiplying
3246-
*by the restriction selectivityis effectively assumingthat the
3247-
*restriction clauses are independent of the grouping, whichis a crummy
3248-
*assumption, but it's hard to do better.
3240+
*more than one Var), and then multiply bya factor based on the
3241+
*selectivity of therestriction clauses for that rel. When there's
3242+
*more than one Var,the initial product is probably too high (it's the
3243+
*worst case) butclamping to a fraction of the rel's rows seems to be a
3244+
*helpfulheuristic for not letting the estimate get out of hand. (The
3245+
*factorof 10 is derived from pre-Postgres-7.4 practice.)The factor
3246+
*we multiplybyto adjust forthe restriction selectivityassumesthat
3247+
*therestriction clauses are independent of the grouping, whichmay not
3248+
*be a validassumption, but it's hard to do better.
32493249
*5. If there are Vars from multiple rels, we repeat step 4 for each such
32503250
*rel, and multiply the results together.
32513251
* Note that rels not containing grouped Vars are ignored completely, as are
@@ -3439,9 +3439,51 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
34393439
reldistinct=clamp;
34403440

34413441
/*
3442-
* Multiply by restriction selectivity.
3442+
* Update the estimate based on the restriction selectivity,
3443+
* guarding against division by zero when reldistinct is zero.
3444+
* Also skip this if we know that we are returning all rows.
34433445
*/
3444-
reldistinct *=rel->rows /rel->tuples;
3446+
if (reldistinct>0&&rel->rows<rel->tuples)
3447+
{
3448+
/*
3449+
* Given a table containing N rows with n distinct values in a
3450+
* uniform distribution, if we select p rows at random then
3451+
* the expected number of distinct values selected is
3452+
*
3453+
* n * (1 - product((N-N/n-i)/(N-i), i=0..p-1))
3454+
*
3455+
* = n * (1 - (N-N/n)! / (N-N/n-p)! * (N-p)! / N!)
3456+
*
3457+
* See "Approximating block accesses in database
3458+
* organizations", S. B. Yao, Communications of the ACM,
3459+
* Volume 20 Issue 4, April 1977 Pages 260-261.
3460+
*
3461+
* Alternatively, re-arranging the terms from the factorials,
3462+
* this may be written as
3463+
*
3464+
* n * (1 - product((N-p-i)/(N-i), i=0..N/n-1))
3465+
*
3466+
* This form of the formula is more efficient to compute in
3467+
* the common case where p is larger than N/n. Additionally,
3468+
* as pointed out by Dell'Era, if i << N for all terms in the
3469+
* product, it can be approximated by
3470+
*
3471+
* n * (1 - ((N-p)/N)^(N/n))
3472+
*
3473+
* See "Expected distinct values when selecting from a bag
3474+
* without replacement", Alberto Dell'Era,
3475+
* http://www.adellera.it/investigations/distinct_balls/.
3476+
*
3477+
* The condition i << N is equivalent to n >> 1, so this is a
3478+
* good approximation when the number of distinct values in
3479+
* the table is large. It turns out that this formula also
3480+
* works well even when n is small.
3481+
*/
3482+
reldistinct *=
3483+
(1-pow((rel->tuples-rel->rows) /rel->tuples,
3484+
rel->tuples /reldistinct));
3485+
}
3486+
reldistinct=clamp_row_est(reldistinct);
34453487

34463488
/*
34473489
* Update estimate of total distinct groups.

‎src/test/regress/expected/subselect.out

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -807,27 +807,24 @@ select * from int4_tbl where
807807
explain (verbose, costs off)
808808
select * from int4_tbl o where (f1, f1) in
809809
(select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
810-
QUERY PLAN
811-
----------------------------------------------------------------------
812-
Hash Join
810+
QUERY PLAN
811+
----------------------------------------------------------------
812+
HashSemiJoin
813813
Output: o.f1
814814
Hash Cond: (o.f1 = "ANY_subquery".f1)
815815
-> Seq Scan on public.int4_tbl o
816816
Output: o.f1
817817
-> Hash
818818
Output: "ANY_subquery".f1, "ANY_subquery".g
819-
->HashAggregate
819+
->Subquery Scan on "ANY_subquery"
820820
Output: "ANY_subquery".f1, "ANY_subquery".g
821-
Group Key: "ANY_subquery".f1, "ANY_subquery".g
822-
-> Subquery Scan on "ANY_subquery"
823-
Output: "ANY_subquery".f1, "ANY_subquery".g
824-
Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
825-
-> HashAggregate
826-
Output: i.f1, (generate_series(1, 2) / 10)
827-
Group Key: i.f1
828-
-> Seq Scan on public.int4_tbl i
829-
Output: i.f1
830-
(18 rows)
821+
Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
822+
-> HashAggregate
823+
Output: i.f1, (generate_series(1, 2) / 10)
824+
Group Key: i.f1
825+
-> Seq Scan on public.int4_tbl i
826+
Output: i.f1
827+
(15 rows)
831828

832829
select * from int4_tbl o where (f1, f1) in
833830
(select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp