Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit6bb6a62

Browse files
committed
Use extended stats for precise estimation of bucket size in hash join
Recognizing the real-life complexity where columns in the table often havefunctional dependencies, PostgreSQL's estimation of the number of distinctvalues over a set of columns can be underestimated (or much rarely,overestimated) when dealing with multi-clause JOIN. In the case of hashjoin, it can end up with a small number of predicted hash buckets and, asa result, picking non-optimal merge join.To improve the situation, we introduce one additional stage of bucket sizeestimation - having two or more join clauses estimator lookup for extendedstatistics and use it for multicolumn estimation. Clauses are grouped intolists, each containing expressions referencing the same relation. The resultof the multicolumn estimation made over such a list is combined with othersaccording to the caller's logic. Clauses that are not estimated are returnedto the caller for further estimation.Discussion:https://postgr.es/m/52257607-57f6-850d-399a-ec33a654457b%40postgrespro.ruAuthor: Andrei Lepikhov <lepihov@gmail.com>Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru>Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
1 parentfae535d commit6bb6a62

File tree

5 files changed

+264
-1
lines changed

5 files changed

+264
-1
lines changed

‎src/backend/optimizer/path/costsize.c

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4339,9 +4339,19 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
43394339
}
43404340
else
43414341
{
4342+
List*otherclauses;
4343+
43424344
innerbucketsize=1.0;
43434345
innermcvfreq=1.0;
4344-
foreach(hcl,hashclauses)
4346+
4347+
/* At first, try to estimate bucket size using extended statistics. */
4348+
otherclauses=estimate_multivariate_bucketsize(root,
4349+
inner_path->parent,
4350+
hashclauses,
4351+
&innerbucketsize);
4352+
4353+
/* Pass through the remaining clauses */
4354+
foreach(hcl,otherclauses)
43454355
{
43464356
RestrictInfo*restrictinfo=lfirst_node(RestrictInfo,hcl);
43474357
Selectivitythisbucketsize;

‎src/backend/utils/adt/selfuncs.c

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3765,6 +3765,181 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
37653765
returnnumdistinct;
37663766
}
37673767

3768+
/*
3769+
* Try to estimate the bucket size of the hash join inner side when the join
3770+
* condition contains two or more clauses by employing extended statistics.
3771+
*
3772+
* The main idea of this approach is that the distinct value generated by
3773+
* multivariate estimation on two or more columns would provide less bucket size
3774+
* than estimation on one separate column.
3775+
*
3776+
* IMPORTANT: It is crucial to synchronize the approach of combining different
3777+
* estimations with the caller's method.
3778+
*
3779+
* Return a list of clauses that didn't fetch any extended statistics.
3780+
*/
3781+
List*
3782+
estimate_multivariate_bucketsize(PlannerInfo*root,RelOptInfo*inner,
3783+
List*hashclauses,
3784+
Selectivity*innerbucketsize)
3785+
{
3786+
List*clauses=list_copy(hashclauses);
3787+
List*otherclauses=NIL;
3788+
doublendistinct=1.0;
3789+
3790+
if (list_length(hashclauses) <=1)
3791+
3792+
/*
3793+
* Nothing to do for a single clause. Could we employ univariate
3794+
* extended stat here?
3795+
*/
3796+
returnhashclauses;
3797+
3798+
while (clauses!=NIL)
3799+
{
3800+
ListCell*lc;
3801+
intrelid=-1;
3802+
List*varinfos=NIL;
3803+
List*origin_rinfos=NIL;
3804+
doublemvndistinct;
3805+
List*origin_varinfos;
3806+
intgroup_relid=-1;
3807+
RelOptInfo*group_rel=NULL;
3808+
ListCell*lc1,
3809+
*lc2;
3810+
3811+
/*
3812+
* Find clauses, referencing the same single base relation and try to
3813+
* estimate such a group with extended statistics. Create varinfo for
3814+
* an approved clause, push it to otherclauses, if it can't be
3815+
* estimated here or ignore to process at the next iteration.
3816+
*/
3817+
foreach(lc,clauses)
3818+
{
3819+
RestrictInfo*rinfo=lfirst_node(RestrictInfo,lc);
3820+
Node*expr;
3821+
Relidsrelids;
3822+
GroupVarInfo*varinfo;
3823+
3824+
/*
3825+
* Find the inner side of the join, which we need to estimate the
3826+
* number of buckets. Use outer_is_left because the
3827+
* clause_sides_match_join routine has called on hash clauses.
3828+
*/
3829+
relids=rinfo->outer_is_left ?
3830+
rinfo->right_relids :rinfo->left_relids;
3831+
expr=rinfo->outer_is_left ?
3832+
get_rightop(rinfo->clause) :get_leftop(rinfo->clause);
3833+
3834+
if (bms_get_singleton_member(relids,&relid)&&
3835+
root->simple_rel_array[relid]->statlist!=NIL)
3836+
{
3837+
/*
3838+
* This inner-side expression references only one relation.
3839+
* Extended statistics on this clause can exist.
3840+
*/
3841+
if (group_relid<0)
3842+
{
3843+
RangeTblEntry*rte=root->simple_rte_array[relid];
3844+
3845+
if (!rte|| (rte->relkind!=RELKIND_RELATION&&
3846+
rte->relkind!=RELKIND_MATVIEW&&
3847+
rte->relkind!=RELKIND_FOREIGN_TABLE&&
3848+
rte->relkind!=RELKIND_PARTITIONED_TABLE))
3849+
{
3850+
/* Extended statistics can't exist in principle */
3851+
otherclauses=lappend(otherclauses,rinfo);
3852+
clauses=foreach_delete_current(clauses,lc);
3853+
continue;
3854+
}
3855+
3856+
group_relid=relid;
3857+
group_rel=root->simple_rel_array[relid];
3858+
}
3859+
elseif (group_relid!=relid)
3860+
3861+
/*
3862+
* Being in the group forming state we don't need other
3863+
* clauses.
3864+
*/
3865+
continue;
3866+
3867+
varinfo= (GroupVarInfo*)palloc(sizeof(GroupVarInfo));
3868+
varinfo->var=expr;
3869+
varinfo->rel=root->simple_rel_array[relid];
3870+
varinfo->ndistinct=0.0;
3871+
varinfo->isdefault= false;
3872+
varinfos=lappend(varinfos,varinfo);
3873+
3874+
/*
3875+
* Remember the link to RestrictInfo for the case the clause
3876+
* is failed to be estimated.
3877+
*/
3878+
origin_rinfos=lappend(origin_rinfos,rinfo);
3879+
}
3880+
else
3881+
/* This clause can't be estimated with extended statistics */
3882+
otherclauses=lappend(otherclauses,rinfo);
3883+
3884+
clauses=foreach_delete_current(clauses,lc);
3885+
}
3886+
3887+
if (list_length(varinfos)<2)
3888+
{
3889+
/*
3890+
* Multivariate statistics doesn't apply to single columns except
3891+
* for expressions, but it has not been implemented yet.
3892+
*/
3893+
otherclauses=list_concat(otherclauses,origin_rinfos);
3894+
list_free_deep(varinfos);
3895+
list_free(origin_rinfos);
3896+
continue;
3897+
}
3898+
3899+
Assert(group_rel!=NULL);
3900+
3901+
/* Employ the extended statistics. */
3902+
origin_varinfos=varinfos;
3903+
for (;;)
3904+
{
3905+
boolestimated=estimate_multivariate_ndistinct(root,
3906+
group_rel,
3907+
&varinfos,
3908+
&mvndistinct);
3909+
3910+
if (!estimated)
3911+
break;
3912+
3913+
/*
3914+
* We've got an estimation. Use ndistinct value in a consistent
3915+
* way - according to the caller's logic (see
3916+
* final_cost_hashjoin).
3917+
*/
3918+
if (ndistinct<mvndistinct)
3919+
ndistinct=mvndistinct;
3920+
Assert(ndistinct >=1.0);
3921+
}
3922+
3923+
Assert(list_length(origin_varinfos)==list_length(origin_rinfos));
3924+
3925+
/* Collect unmatched clauses as otherclauses. */
3926+
forboth(lc1,origin_varinfos,lc2,origin_rinfos)
3927+
{
3928+
GroupVarInfo*vinfo=lfirst(lc1);
3929+
3930+
if (!list_member_ptr(varinfos,vinfo))
3931+
/* Already estimated */
3932+
continue;
3933+
3934+
/* Can't be estimated here - push to the returning list */
3935+
otherclauses=lappend(otherclauses,lfirst(lc2));
3936+
}
3937+
}
3938+
3939+
*innerbucketsize=1.0 /ndistinct;
3940+
returnotherclauses;
3941+
}
3942+
37683943
/*
37693944
* Estimate hash bucket statistics when the specified expression is used
37703945
* as a hash key for the given number of buckets.

‎src/include/utils/selfuncs.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,10 @@ extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
218218
doubleinput_rows,List**pgset,
219219
EstimationInfo*estinfo);
220220

221+
externList*estimate_multivariate_bucketsize(PlannerInfo*root,
222+
RelOptInfo*inner,
223+
List*hashclauses,
224+
Selectivity*innerbucketsize);
221225
externvoidestimate_hash_bucket_stats(PlannerInfo*root,
222226
Node*hashkey,doublenbuckets,
223227
Selectivity*mcv_freq,

‎src/test/regress/expected/stats_ext.out

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3383,3 +3383,48 @@ SELECT * FROM check_estimated_rows('
33833383
(1 row)
33843384

33853385
DROP TABLE grouping_unique;
3386+
--
3387+
-- Extended statistics on sb_2 (x, y, z) improve a bucket size estimation,
3388+
-- and the optimizer may choose hash join.
3389+
--
3390+
CREATE TABLE sb_1 AS
3391+
SELECT gs % 10 AS x, gs % 10 AS y, gs % 10 AS z
3392+
FROM generate_series(1, 1e4) AS gs;
3393+
CREATE TABLE sb_2 AS
3394+
SELECT gs % 49 AS x, gs % 51 AS y, gs % 73 AS z, 'abc' || gs AS payload
3395+
FROM generate_series(1, 1e4) AS gs;
3396+
ANALYZE sb_1, sb_2;
3397+
-- During hash join estimation, the number of distinct values on each column
3398+
-- is calculated. The optimizer selects the smallest number of distinct values
3399+
-- and the largest hash bucket size. The optimizer decides that the hash
3400+
-- bucket size is quite big because there are possibly many correlations.
3401+
EXPLAIN (COSTS OFF) -- Choose merge join
3402+
SELECT * FROM sb_1 a, sb_2 b WHERE a.x = b.x AND a.y = b.y AND a.z = b.z;
3403+
QUERY PLAN
3404+
-------------------------------------------------------------
3405+
Merge Join
3406+
Merge Cond: ((a.z = b.z) AND (a.x = b.x) AND (a.y = b.y))
3407+
-> Sort
3408+
Sort Key: a.z, a.x, a.y
3409+
-> Seq Scan on sb_1 a
3410+
-> Sort
3411+
Sort Key: b.z, b.x, b.y
3412+
-> Seq Scan on sb_2 b
3413+
(8 rows)
3414+
3415+
-- The ndistinct extended statistics on (x, y, z) provides more reliable value
3416+
-- of bucket size.
3417+
CREATE STATISTICS extstat_sb_2 (ndistinct) ON x, y, z FROM sb_2;
3418+
ANALYZE sb_2;
3419+
EXPLAIN (COSTS OFF) -- Choose hash join
3420+
SELECT * FROM sb_1 a, sb_2 b WHERE a.x = b.x AND a.y = b.y AND a.z = b.z;
3421+
QUERY PLAN
3422+
------------------------------------------------------------
3423+
Hash Join
3424+
Hash Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))
3425+
-> Seq Scan on sb_1 a
3426+
-> Hash
3427+
-> Seq Scan on sb_2 b
3428+
(5 rows)
3429+
3430+
DROP TABLE sb_1, sb_2 CASCADE;

‎src/test/regress/sql/stats_ext.sql

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1719,3 +1719,32 @@ SELECT * FROM check_estimated_rows('
17191719
ON t1.t1 = q1.x;
17201720
');
17211721
DROPTABLE grouping_unique;
1722+
1723+
--
1724+
-- Extended statistics on sb_2 (x, y, z) improve a bucket size estimation,
1725+
-- and the optimizer may choose hash join.
1726+
--
1727+
CREATETABLEsb_1AS
1728+
SELECT gs %10AS x, gs %10AS y, gs %10AS z
1729+
FROM generate_series(1, 1e4)AS gs;
1730+
CREATETABLEsb_2AS
1731+
SELECT gs %49AS x, gs %51AS y, gs %73AS z,'abc'|| gsAS payload
1732+
FROM generate_series(1, 1e4)AS gs;
1733+
ANALYZE sb_1, sb_2;
1734+
1735+
-- During hash join estimation, the number of distinct values on each column
1736+
-- is calculated. The optimizer selects the smallest number of distinct values
1737+
-- and the largest hash bucket size. The optimizer decides that the hash
1738+
-- bucket size is quite big because there are possibly many correlations.
1739+
EXPLAIN (COSTS OFF)-- Choose merge join
1740+
SELECT*FROM sb_1 a, sb_2 bWHEREa.x=b.xANDa.y=b.yANDa.z=b.z;
1741+
1742+
-- The ndistinct extended statistics on (x, y, z) provides more reliable value
1743+
-- of bucket size.
1744+
CREATE STATISTICS extstat_sb_2 (ndistinct)ON x, y, zFROM sb_2;
1745+
ANALYZE sb_2;
1746+
1747+
EXPLAIN (COSTS OFF)-- Choose hash join
1748+
SELECT*FROM sb_1 a, sb_2 bWHEREa.x=b.xANDa.y=b.yANDa.z=b.z;
1749+
1750+
DROPTABLE sb_1, sb_2 CASCADE;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp