NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commitcb5c149

committed

Fix misestimation of n_distinct for a nearly-unique column with many nulls.

If ANALYZE found no repeated non-null entries in its sample, it set thecolumn's stadistinct value to -1.0, intending to indicate that the entriesare all distinct. But what this value actually means is that the numberof distinct values is 100% of the table's rowcount, and thus it wasoverestimating the number of distinct values by however many nulls thereare. This could lead to very poor selectivity estimates, as for examplein a recent report from Andreas Joseph Krogh. We should discount thestadistinct value by whatever we've estimated the nulls fraction to be.(That is what will happen if we choose to use a negative stadistinct fora column that does have repeated entries, so this code path was justinconsistent.)In addition to fixing the stadistinct entries stored by several differentANALYZE code paths, adjust the logic where get_variable_numdistinct()forces an "all distinct" estimate on the basis of finding a relevant uniqueindex. Unique indexes don't reject nulls, so there's no reason to assumethat the null fraction doesn't apply.Back-patch to all supported branches. Back-patching is a bit of a judgmentcall, but this problem seems to affect only a few users (else we'd haveidentified it long ago), and it's bad enough when it does happen thatdestabilizing plan choices in a worse direction seems unlikely.Patch by me, with documentation wording suggested by Dean RasheedReport: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>Discussion: <16143.1470350371@sss.pgh.pa.us>

1 parent71dca40 commitcb5c149Copy full SHA for cb5c149

File tree

6 files changed

+34

-21

lines changed

doc/src/sgml
- catalogs.sgml
src
- backend
  - commands
    - analyze.c
  - tsearch
    - ts_typanalyze.c
  - utils/adt
    - rangetypes_typanalyze.c
    - selfuncs.c
- include/catalog
  - pg_statistic.h

6 files changed

+34

-21

lines changed

`‎doc/src/sgml/catalogs.sgml`

Lines changed: 3 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -5890,9 +5890,9 @@`
`5890`	`5890`	`<entry>The number of distinct nonnull data values in the column.`
`5891`	`5891`	`A value greater than zero is the actual number of distinct values.`
`5892`	`5892`	`A value less than zero is the negative of a multiplier for the number`
`5893`		`- of rows in the table; for example, a column in whichvalues appear about`
`5894`		`-twice on the average could be represented by`
`5895`		`- <structfield>stadistinct</> = -0.5.`
	`5893`	`+ of rows in the table; for example, a column in whichabout 80% of the`
	`5894`	`+values are nonnull and each nonnull value appears about twice on`
	`5895`	`+average could be represented by<structfield>stadistinct</> = -0.4.`
`5896`	`5896`	`A zero value means the number of distinct values is unknown.`
`5897`	`5897`	`</entry>`
`5898`	`5898`	`</row>`

`‎src/backend/commands/analyze.c`

Lines changed: 11 additions & 5 deletions

Original file line number	Diff line number	Diff line change
`@@ -2043,8 +2043,11 @@ compute_distinct_stats(VacAttrStatsP stats,`
`2043`	`2043`
`2044`	`2044`	`if (nmultiple==0)`
`2045`	`2045`	`{`
`2046`		`-/* If we found no repeated values, assume it's a unique column */`
`2047`		`-stats->stadistinct=-1.0;`
	`2046`	`+/*`
	`2047`	`+ * If we found no repeated non-null values, assume it's a unique`
	`2048`	`+ * column; but be sure to discount for any nulls we found.`
	`2049`	`+ */`
	`2050`	`+stats->stadistinct=-1.0* (1.0-stats->stanullfrac);`
`2048`	`2051`	`}`
`2049`	`2052`	`elseif (track_cnt<track_max&&toowide_cnt==0&&`
`2050`	`2053`	`nmultiple==track_cnt)`
`@@ -2398,8 +2401,11 @@ compute_scalar_stats(VacAttrStatsP stats,`
`2398`	`2401`
`2399`	`2402`	`if (nmultiple==0)`
`2400`	`2403`	`{`
`2401`		`-/* If we found no repeated values, assume it's a unique column */`
`2402`		`-stats->stadistinct=-1.0;`
	`2404`	`+/*`
	`2405`	`+ * If we found no repeated non-null values, assume it's a unique`
	`2406`	`+ * column; but be sure to discount for any nulls we found.`
	`2407`	`+ */`
	`2408`	`+stats->stadistinct=-1.0* (1.0-stats->stanullfrac);`
`2403`	`2409`	`}`
`2404`	`2410`	`elseif (toowide_cnt==0&&nmultiple==ndistinct)`
`2405`	`2411`	`{`
`@@ -2703,7 +2709,7 @@ compute_scalar_stats(VacAttrStatsP stats,`
`2703`	`2709`	`else`
`2704`	`2710`	`stats->stawidth=stats->attrtype->typlen;`
`2705`	`2711`	`/* Assume all too-wide values are distinct, so it's a unique column */`
`2706`		`-stats->stadistinct=-1.0;`
	`2712`	`+stats->stadistinct=-1.0* (1.0-stats->stanullfrac);`
`2707`	`2713`	`}`
`2708`	`2714`	`elseif (null_cnt>0)`
`2709`	`2715`	`{`

`‎src/backend/tsearch/ts_typanalyze.c`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -295,7 +295,7 @@ compute_tsvector_stats(VacAttrStats *stats,`
`295`	`295`	`stats->stawidth=total_width / (double)nonnull_cnt;`
`296`	`296`
`297`	`297`	`/* Assume it's a unique column (see notes above) */`
`298`		`-stats->stadistinct=-1.0;`
	`298`	`+stats->stadistinct=-1.0* (1.0-stats->stanullfrac);`
`299`	`299`
`300`	`300`	`/*`
`301`	`301`	`* Construct an array of the interesting hashtable items, that is,`

`‎src/backend/utils/adt/rangetypes_typanalyze.c`

Lines changed: 3 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -203,7 +203,9 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,`
`203`	`203`	`/* Do the simple null-frac and width stats */`
`204`	`204`	`stats->stanullfrac= (double)null_cnt / (double)samplerows;`
`205`	`205`	`stats->stawidth=total_width / (double)non_null_cnt;`
`206`		`-stats->stadistinct=-1.0;`
	`206`	`+`
	`207`	`+/* Estimate that non-null values are unique */`
	`208`	`+stats->stadistinct=-1.0* (1.0-stats->stanullfrac);`
`207`	`209`
`208`	`210`	`/* Must copy the target values into anl_context */`
`209`	`211`	`old_cxt=MemoryContextSwitchTo(stats->anl_context);`

`‎src/backend/utils/adt/selfuncs.c`

Lines changed: 8 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -4684,14 +4684,16 @@ double`
`4684`	`4684`	`get_variable_numdistinct(VariableStatDatavardata,boolisdefault)`
`4685`	`4685`	`{`
`4686`	`4686`	`doublestadistinct;`
	`4687`	`+doublestanullfrac=0.0;`
`4687`	`4688`	`doublentuples;`
`4688`	`4689`
`4689`	`4690`	`*isdefault= false;`
`4690`	`4691`
`4691`	`4692`	`/*`
`4692`	`4693`	`* Determine the stadistinct value to use. There are cases where we can`
`4693`	`4694`	`* get an estimate even without a pg_statistic entry, or can get a better`
`4694`		`- * value than is in pg_statistic.`
	`4695`	`+ * value than is in pg_statistic. Grab stanullfrac too if we can find it`
	`4696`	`+ * (otherwise, assume no nulls, for lack of any better idea).`
`4695`	`4697`	`*/`
`4696`	`4698`	`if (HeapTupleIsValid(vardata->statsTuple))`
`4697`	`4699`	`{`
`@@ -4700,6 +4702,7 @@ get_variable_numdistinct(VariableStatData vardata, bool isdefault)`
`4700`	`4702`
`4701`	`4703`	`stats= (Form_pg_statistic)GETSTRUCT(vardata->statsTuple);`
`4702`	`4704`	`stadistinct=stats->stadistinct;`
	`4705`	`+stanullfrac=stats->stanullfrac;`
`4703`	`4706`	`}`
`4704`	`4707`	`elseif (vardata->vartype==BOOLOID)`
`4705`	`4708`	`{`
`@@ -4723,7 +4726,7 @@ get_variable_numdistinct(VariableStatData vardata, bool isdefault)`
`4723`	`4726`	`{`
`4724`	`4727`	`caseObjectIdAttributeNumber:`
`4725`	`4728`	`caseSelfItemPointerAttributeNumber:`
`4726`		`-stadistinct=-1.0;/* unique */`
	`4729`	`+stadistinct=-1.0;/* unique(and all non null)*/`
`4727`	`4730`	`break;`
`4728`	`4731`	`caseTableOidAttributeNumber:`
`4729`	`4732`	`stadistinct=1.0;/* only 1 value */`
`@@ -4745,10 +4748,11 @@ get_variable_numdistinct(VariableStatData vardata, bool isdefault)`
`4745`	`4748`	`* If there is a unique index or DISTINCT clause for the variable, assume`
`4746`	`4749`	`* it is unique no matter what pg_statistic says; the statistics could be`
`4747`	`4750`	`* out of date, or we might have found a partial unique index that proves`
`4748`		`- * the var is unique for this query.`
	`4751`	`+ * the var is unique for this query. However, we'd better still believe`
	`4752`	`+ * the null-fraction statistic.`
`4749`	`4753`	`*/`
`4750`	`4754`	`if (vardata->isunique)`
`4751`		`-stadistinct=-1.0;`
	`4755`	`+stadistinct=-1.0* (1.0-stanullfrac);`
`4752`	`4756`
`4753`	`4757`	`/*`
`4754`	`4758`	`* If we had an absolute estimate, use that.`

`‎src/include/catalog/pg_statistic.h`

Lines changed: 8 additions & 7 deletions

Original file line number	Diff line number	Diff line change
`@@ -57,13 +57,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS`
`57`	`57`	`*> 0actual number of distinct values`
`58`	`58`	`*< 0negative of multiplier for number of rows`
`59`	`59`	`* The special negative case allows us to cope with columns that are`
`60`		`- * unique (stadistinct = -1) or nearly so (for example, a column in`
`61`		`- * which values appear about twice on the average could be represented`
`62`		`- * by stadistinct = -0.5). Because the number-of-rows statistic in`
`63`		`- * pg_class may be updated more frequently than pg_statistic is, it's`
`64`		`- * important to be able to describe such situations as a multiple of`
`65`		`- * the number of rows, rather than a fixed number of distinct values.`
`66`		`- * But in other cases a fixed number is correct (eg, a boolean column).`
	`60`	`+ * unique (stadistinct = -1) or nearly so (for example, a column in which`
	`61`	`+ * non-null values appear about twice on the average could be represented`
	`62`	`+ * by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the`
	`63`	`+ * column is nulls). Because the number-of-rows statistic in pg_class may`
	`64`	`+ * be updated more frequently than pg_statistic is, it's important to be`
	`65`	`+ * able to describe such situations as a multiple of the number of rows,`
	`66`	`+ * rather than a fixed number of distinct values. But in other cases a`
	`67`	`+ * fixed number is correct (eg, a boolean column).`
`67`	`68`	`* ----------------`
`68`	`69`	`*/`
`69`	`70`	`float4stadistinct;`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitcb5c149

File tree

6 files changed

6 files changed

`‎doc/src/sgml/catalogs.sgml`

`‎src/backend/commands/analyze.c`

`‎src/backend/tsearch/ts_typanalyze.c`

`‎src/backend/utils/adt/rangetypes_typanalyze.c`

`‎src/backend/utils/adt/selfuncs.c`

`‎src/include/catalog/pg_statistic.h`

0 commit comments