NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit6bb6a62

committed

Use extended stats for precise estimation of bucket size in hash join

Recognizing the real-life complexity where columns in the table often havefunctional dependencies, PostgreSQL's estimation of the number of distinctvalues over a set of columns can be underestimated (or much rarely,overestimated) when dealing with multi-clause JOIN. In the case of hashjoin, it can end up with a small number of predicted hash buckets and, asa result, picking non-optimal merge join.To improve the situation, we introduce one additional stage of bucket sizeestimation - having two or more join clauses estimator lookup for extendedstatistics and use it for multicolumn estimation. Clauses are grouped intolists, each containing expressions referencing the same relation. The resultof the multicolumn estimation made over such a list is combined with othersaccording to the caller's logic. Clauses that are not estimated are returnedto the caller for further estimation.Discussion:https://postgr.es/m/52257607-57f6-850d-399a-ec33a654457b%40postgrespro.ruAuthor: Andrei Lepikhov <lepihov@gmail.com>Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru>Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>

1 parentfae535d commit6bb6a62Copy full SHA for 6bb6a62

File tree

5 files changed

+264

-1

lines changed

src
- backend
  - optimizer/path
    - costsize.c
  - utils/adt
    - selfuncs.c
- include/utils
  - selfuncs.h
- test/regress
  - expected
    - stats_ext.out
  - sql
    - stats_ext.sql

5 files changed

+264

-1

lines changed

`‎src/backend/optimizer/path/costsize.c`

Lines changed: 11 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -4339,9 +4339,19 @@ final_cost_hashjoin(PlannerInfo root, HashPath path,`
`4339`	`4339`	`}`
`4340`	`4340`	`else`
`4341`	`4341`	`{`
	`4342`	`+List*otherclauses;`
	`4343`	`+`
`4342`	`4344`	`innerbucketsize=1.0;`
`4343`	`4345`	`innermcvfreq=1.0;`
`4344`		`-foreach(hcl,hashclauses)`
	`4346`	`+`
	`4347`	`+/* At first, try to estimate bucket size using extended statistics. */`
	`4348`	`+otherclauses=estimate_multivariate_bucketsize(root,`
	`4349`	`+inner_path->parent,`
	`4350`	`+hashclauses,`
	`4351`	`+&innerbucketsize);`
	`4352`	`+`
	`4353`	`+/* Pass through the remaining clauses */`
	`4354`	`+foreach(hcl,otherclauses)`
`4345`	`4355`	`{`
`4346`	`4356`	`RestrictInfo*restrictinfo=lfirst_node(RestrictInfo,hcl);`
`4347`	`4357`	`Selectivitythisbucketsize;`

`‎src/backend/utils/adt/selfuncs.c`

Lines changed: 175 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -3765,6 +3765,181 @@ estimate_num_groups(PlannerInfo root, List groupExprs, double input_rows,`
`3765`	`3765`	`returnnumdistinct;`
`3766`	`3766`	`}`
`3767`	`3767`
	`3768`	`+/*`
	`3769`	`+ * Try to estimate the bucket size of the hash join inner side when the join`
	`3770`	`+ * condition contains two or more clauses by employing extended statistics.`
	`3771`	`+ *`
	`3772`	`+ * The main idea of this approach is that the distinct value generated by`
	`3773`	`+ * multivariate estimation on two or more columns would provide less bucket size`
	`3774`	`+ * than estimation on one separate column.`
	`3775`	`+ *`
	`3776`	`+ * IMPORTANT: It is crucial to synchronize the approach of combining different`
	`3777`	`+ * estimations with the caller's method.`
	`3778`	`+ *`
	`3779`	`+ * Return a list of clauses that didn't fetch any extended statistics.`
	`3780`	`+ */`
	`3781`	`+List*`
	`3782`	`+estimate_multivariate_bucketsize(PlannerInforoot,RelOptInfoinner,`
	`3783`	`+List*hashclauses,`
	`3784`	`+Selectivity*innerbucketsize)`
	`3785`	`+{`
	`3786`	`+List*clauses=list_copy(hashclauses);`
	`3787`	`+List*otherclauses=NIL;`
	`3788`	`+doublendistinct=1.0;`
	`3789`	`+`
	`3790`	`+if (list_length(hashclauses) <=1)`
	`3791`	`+`
	`3792`	`+/*`
	`3793`	`+ * Nothing to do for a single clause. Could we employ univariate`
	`3794`	`+ * extended stat here?`
	`3795`	`+ */`
	`3796`	`+returnhashclauses;`
	`3797`	`+`
	`3798`	`+while (clauses!=NIL)`
	`3799`	`+{`
	`3800`	`+ListCell*lc;`
	`3801`	`+intrelid=-1;`
	`3802`	`+List*varinfos=NIL;`
	`3803`	`+List*origin_rinfos=NIL;`
	`3804`	`+doublemvndistinct;`
	`3805`	`+List*origin_varinfos;`
	`3806`	`+intgroup_relid=-1;`
	`3807`	`+RelOptInfo*group_rel=NULL;`
	`3808`	`+ListCell*lc1,`
	`3809`	`+*lc2;`
	`3810`	`+`
	`3811`	`+/*`
	`3812`	`+ * Find clauses, referencing the same single base relation and try to`
	`3813`	`+ * estimate such a group with extended statistics. Create varinfo for`
	`3814`	`+ * an approved clause, push it to otherclauses, if it can't be`
	`3815`	`+ * estimated here or ignore to process at the next iteration.`
	`3816`	`+ */`
	`3817`	`+foreach(lc,clauses)`
	`3818`	`+{`
	`3819`	`+RestrictInfo*rinfo=lfirst_node(RestrictInfo,lc);`
	`3820`	`+Node*expr;`
	`3821`	`+Relidsrelids;`
	`3822`	`+GroupVarInfo*varinfo;`
	`3823`	`+`
	`3824`	`+/*`
	`3825`	`+ * Find the inner side of the join, which we need to estimate the`
	`3826`	`+ * number of buckets. Use outer_is_left because the`
	`3827`	`+ * clause_sides_match_join routine has called on hash clauses.`
	`3828`	`+ */`
	`3829`	`+relids=rinfo->outer_is_left ?`
	`3830`	`+rinfo->right_relids :rinfo->left_relids;`
	`3831`	`+expr=rinfo->outer_is_left ?`
	`3832`	`+get_rightop(rinfo->clause) :get_leftop(rinfo->clause);`
	`3833`	`+`
	`3834`	`+if (bms_get_singleton_member(relids,&relid)&&`
	`3835`	`+root->simple_rel_array[relid]->statlist!=NIL)`
	`3836`	`+{`
	`3837`	`+/*`
	`3838`	`+ * This inner-side expression references only one relation.`
	`3839`	`+ * Extended statistics on this clause can exist.`
	`3840`	`+ */`
	`3841`	`+if (group_relid<0)`
	`3842`	`+{`
	`3843`	`+RangeTblEntry*rte=root->simple_rte_array[relid];`
	`3844`	`+`
	`3845`	`+if (!rte\|\| (rte->relkind!=RELKIND_RELATION&&`
	`3846`	`+rte->relkind!=RELKIND_MATVIEW&&`
	`3847`	`+rte->relkind!=RELKIND_FOREIGN_TABLE&&`
	`3848`	`+rte->relkind!=RELKIND_PARTITIONED_TABLE))`
	`3849`	`+{`
	`3850`	`+/* Extended statistics can't exist in principle */`
	`3851`	`+otherclauses=lappend(otherclauses,rinfo);`
	`3852`	`+clauses=foreach_delete_current(clauses,lc);`
	`3853`	`+continue;`
	`3854`	`+}`
	`3855`	`+`
	`3856`	`+group_relid=relid;`
	`3857`	`+group_rel=root->simple_rel_array[relid];`
	`3858`	`+}`
	`3859`	`+elseif (group_relid!=relid)`
	`3860`	`+`
	`3861`	`+/*`
	`3862`	`+ * Being in the group forming state we don't need other`
	`3863`	`+ * clauses.`
	`3864`	`+ */`
	`3865`	`+continue;`
	`3866`	`+`
	`3867`	`+varinfo= (GroupVarInfo*)palloc(sizeof(GroupVarInfo));`
	`3868`	`+varinfo->var=expr;`
	`3869`	`+varinfo->rel=root->simple_rel_array[relid];`
	`3870`	`+varinfo->ndistinct=0.0;`
	`3871`	`+varinfo->isdefault= false;`
	`3872`	`+varinfos=lappend(varinfos,varinfo);`
	`3873`	`+`
	`3874`	`+/*`
	`3875`	`+ * Remember the link to RestrictInfo for the case the clause`
	`3876`	`+ * is failed to be estimated.`
	`3877`	`+ */`
	`3878`	`+origin_rinfos=lappend(origin_rinfos,rinfo);`
	`3879`	`+}`
	`3880`	`+else`
	`3881`	`+/* This clause can't be estimated with extended statistics */`
	`3882`	`+otherclauses=lappend(otherclauses,rinfo);`
	`3883`	`+`
	`3884`	`+clauses=foreach_delete_current(clauses,lc);`
	`3885`	`+}`
	`3886`	`+`
	`3887`	`+if (list_length(varinfos)<2)`
	`3888`	`+{`
	`3889`	`+/*`
	`3890`	`+ * Multivariate statistics doesn't apply to single columns except`
	`3891`	`+ * for expressions, but it has not been implemented yet.`
	`3892`	`+ */`
	`3893`	`+otherclauses=list_concat(otherclauses,origin_rinfos);`
	`3894`	`+list_free_deep(varinfos);`
	`3895`	`+list_free(origin_rinfos);`
	`3896`	`+continue;`
	`3897`	`+}`
	`3898`	`+`
	`3899`	`+Assert(group_rel!=NULL);`
	`3900`	`+`
	`3901`	`+/* Employ the extended statistics. */`
	`3902`	`+origin_varinfos=varinfos;`
	`3903`	`+for (;;)`
	`3904`	`+{`
	`3905`	`+boolestimated=estimate_multivariate_ndistinct(root,`
	`3906`	`+group_rel,`
	`3907`	`+&varinfos,`
	`3908`	`+&mvndistinct);`
	`3909`	`+`
	`3910`	`+if (!estimated)`
	`3911`	`+break;`
	`3912`	`+`
	`3913`	`+/*`
	`3914`	`+ * We've got an estimation. Use ndistinct value in a consistent`
	`3915`	`+ * way - according to the caller's logic (see`
	`3916`	`+ * final_cost_hashjoin).`
	`3917`	`+ */`
	`3918`	`+if (ndistinct<mvndistinct)`
	`3919`	`+ndistinct=mvndistinct;`
	`3920`	`+Assert(ndistinct >=1.0);`
	`3921`	`+}`
	`3922`	`+`
	`3923`	`+Assert(list_length(origin_varinfos)==list_length(origin_rinfos));`
	`3924`	`+`
	`3925`	`+/* Collect unmatched clauses as otherclauses. */`
	`3926`	`+forboth(lc1,origin_varinfos,lc2,origin_rinfos)`
	`3927`	`+{`
	`3928`	`+GroupVarInfo*vinfo=lfirst(lc1);`
	`3929`	`+`
	`3930`	`+if (!list_member_ptr(varinfos,vinfo))`
	`3931`	`+/* Already estimated */`
	`3932`	`+continue;`
	`3933`	`+`
	`3934`	`+/* Can't be estimated here - push to the returning list */`
	`3935`	`+otherclauses=lappend(otherclauses,lfirst(lc2));`
	`3936`	`+}`
	`3937`	`+}`
	`3938`	`+`
	`3939`	`+*innerbucketsize=1.0 /ndistinct;`
	`3940`	`+returnotherclauses;`
	`3941`	`+}`
	`3942`	`+`
`3768`	`3943`	`/*`
`3769`	`3944`	`* Estimate hash bucket statistics when the specified expression is used`
`3770`	`3945`	`* as a hash key for the given number of buckets.`

`‎src/include/utils/selfuncs.h`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -218,6 +218,10 @@ extern double estimate_num_groups(PlannerInfo root, List groupExprs,`
`218`	`218`	`doubleinput_rows,List**pgset,`
`219`	`219`	`EstimationInfo*estinfo);`
`220`	`220`
	`221`	`+externListestimate_multivariate_bucketsize(PlannerInforoot,`
	`222`	`+RelOptInfo*inner,`
	`223`	`+List*hashclauses,`
	`224`	`+Selectivity*innerbucketsize);`
`221`	`225`	`externvoidestimate_hash_bucket_stats(PlannerInfo*root,`
`222`	`226`	`Node*hashkey,doublenbuckets,`
`223`	`227`	`Selectivity*mcv_freq,`

`‎src/test/regress/expected/stats_ext.out`

Lines changed: 45 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -3383,3 +3383,48 @@ SELECT * FROM check_estimated_rows('`
`3383`	`3383`	`(1 row)`
`3384`	`3384`
`3385`	`3385`	`DROP TABLE grouping_unique;`
	`3386`	`+--`
	`3387`	`+-- Extended statistics on sb_2 (x, y, z) improve a bucket size estimation,`
	`3388`	`+-- and the optimizer may choose hash join.`
	`3389`	`+--`
	`3390`	`+CREATE TABLE sb_1 AS`
	`3391`	`+ SELECT gs % 10 AS x, gs % 10 AS y, gs % 10 AS z`
	`3392`	`+ FROM generate_series(1, 1e4) AS gs;`
	`3393`	`+CREATE TABLE sb_2 AS`
	`3394`	`+ SELECT gs % 49 AS x, gs % 51 AS y, gs % 73 AS z, 'abc' \|\| gs AS payload`
	`3395`	`+ FROM generate_series(1, 1e4) AS gs;`
	`3396`	`+ANALYZE sb_1, sb_2;`
	`3397`	`+-- During hash join estimation, the number of distinct values on each column`
	`3398`	`+-- is calculated. The optimizer selects the smallest number of distinct values`
	`3399`	`+-- and the largest hash bucket size. The optimizer decides that the hash`
	`3400`	`+-- bucket size is quite big because there are possibly many correlations.`
	`3401`	`+EXPLAIN (COSTS OFF) -- Choose merge join`
	`3402`	`+SELECT * FROM sb_1 a, sb_2 b WHERE a.x = b.x AND a.y = b.y AND a.z = b.z;`
	`3403`	`+ QUERY PLAN`
	`3404`	`+-------------------------------------------------------------`
	`3405`	`+ Merge Join`
	`3406`	`+ Merge Cond: ((a.z = b.z) AND (a.x = b.x) AND (a.y = b.y))`
	`3407`	`+ -> Sort`
	`3408`	`+ Sort Key: a.z, a.x, a.y`
	`3409`	`+ -> Seq Scan on sb_1 a`
	`3410`	`+ -> Sort`
	`3411`	`+ Sort Key: b.z, b.x, b.y`
	`3412`	`+ -> Seq Scan on sb_2 b`
	`3413`	`+(8 rows)`
	`3414`	`+`
	`3415`	`+-- The ndistinct extended statistics on (x, y, z) provides more reliable value`
	`3416`	`+-- of bucket size.`
	`3417`	`+CREATE STATISTICS extstat_sb_2 (ndistinct) ON x, y, z FROM sb_2;`
	`3418`	`+ANALYZE sb_2;`
	`3419`	`+EXPLAIN (COSTS OFF) -- Choose hash join`
	`3420`	`+SELECT * FROM sb_1 a, sb_2 b WHERE a.x = b.x AND a.y = b.y AND a.z = b.z;`
	`3421`	`+ QUERY PLAN`
	`3422`	`+------------------------------------------------------------`
	`3423`	`+ Hash Join`
	`3424`	`+ Hash Cond: ((a.x = b.x) AND (a.y = b.y) AND (a.z = b.z))`
	`3425`	`+ -> Seq Scan on sb_1 a`
	`3426`	`+ -> Hash`
	`3427`	`+ -> Seq Scan on sb_2 b`
	`3428`	`+(5 rows)`
	`3429`	`+`
	`3430`	`+DROP TABLE sb_1, sb_2 CASCADE;`

`‎src/test/regress/sql/stats_ext.sql`

Lines changed: 29 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1719,3 +1719,32 @@ SELECT * FROM check_estimated_rows('`
`1719`	`1719`	`ON t1.t1 = q1.x;`
`1720`	`1720`	`');`
`1721`	`1721`	`DROPTABLE grouping_unique;`
	`1722`	`+`
	`1723`	`+--`
	`1724`	`+-- Extended statistics on sb_2 (x, y, z) improve a bucket size estimation,`
	`1725`	`+-- and the optimizer may choose hash join.`
	`1726`	`+--`
	`1727`	`+CREATETABLEsb_1AS`
	`1728`	`+SELECT gs %10AS x, gs %10AS y, gs %10AS z`
	`1729`	`+FROM generate_series(1, 1e4)AS gs;`
	`1730`	`+CREATETABLEsb_2AS`
	`1731`	`+SELECT gs %49AS x, gs %51AS y, gs %73AS z,'abc'\|\| gsAS payload`
	`1732`	`+FROM generate_series(1, 1e4)AS gs;`
	`1733`	`+ANALYZE sb_1, sb_2;`
	`1734`	`+`
	`1735`	`+-- During hash join estimation, the number of distinct values on each column`
	`1736`	`+-- is calculated. The optimizer selects the smallest number of distinct values`
	`1737`	`+-- and the largest hash bucket size. The optimizer decides that the hash`
	`1738`	`+-- bucket size is quite big because there are possibly many correlations.`
	`1739`	`+EXPLAIN (COSTS OFF)-- Choose merge join`
	`1740`	`+SELECT*FROM sb_1 a, sb_2 bWHEREa.x=b.xANDa.y=b.yANDa.z=b.z;`
	`1741`	`+`
	`1742`	`+-- The ndistinct extended statistics on (x, y, z) provides more reliable value`
	`1743`	`+-- of bucket size.`
	`1744`	`+CREATE STATISTICS extstat_sb_2 (ndistinct)ON x, y, zFROM sb_2;`
	`1745`	`+ANALYZE sb_2;`
	`1746`	`+`
	`1747`	`+EXPLAIN (COSTS OFF)-- Choose hash join`
	`1748`	`+SELECT*FROM sb_1 a, sb_2 bWHEREa.x=b.xANDa.y=b.yANDa.z=b.z;`
	`1749`	`+`
	`1750`	`+DROPTABLE sb_1, sb_2 CASCADE;`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit6bb6a62

File tree

5 files changed

5 files changed

`‎src/backend/optimizer/path/costsize.c`

`‎src/backend/utils/adt/selfuncs.c`

`‎src/include/utils/selfuncs.h`

`‎src/test/regress/expected/stats_ext.out`

`‎src/test/regress/sql/stats_ext.sql`

0 commit comments