Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitcb5c149

Browse files
committed
Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set thecolumn's stadistinct value to -1.0, intending to indicate that the entriesare all distinct. But what this value actually means is that the numberof distinct values is 100% of the table's rowcount, and thus it wasoverestimating the number of distinct values by however many nulls thereare. This could lead to very poor selectivity estimates, as for examplein a recent report from Andreas Joseph Krogh. We should discount thestadistinct value by whatever we've estimated the nulls fraction to be.(That is what will happen if we choose to use a negative stadistinct fora column that does have repeated entries, so this code path was justinconsistent.)In addition to fixing the stadistinct entries stored by several differentANALYZE code paths, adjust the logic where get_variable_numdistinct()forces an "all distinct" estimate on the basis of finding a relevant uniqueindex. Unique indexes don't reject nulls, so there's no reason to assumethat the null fraction doesn't apply.Back-patch to all supported branches. Back-patching is a bit of a judgmentcall, but this problem seems to affect only a few users (else we'd haveidentified it long ago), and it's bad enough when it does happen thatdestabilizing plan choices in a worse direction seems unlikely.Patch by me, with documentation wording suggested by Dean RasheedReport: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>Discussion: <16143.1470350371@sss.pgh.pa.us>
1 parent71dca40 commitcb5c149

File tree

6 files changed

+34
-21
lines changed

6 files changed

+34
-21
lines changed

‎doc/src/sgml/catalogs.sgml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5890,9 +5890,9 @@
58905890
<entry>The number of distinct nonnull data values in the column.
58915891
A value greater than zero is the actual number of distinct values.
58925892
A value less than zero is the negative of a multiplier for the number
5893-
of rows in the table; for example, a column in whichvalues appear about
5894-
twice on the average could be represented by
5895-
<structfield>stadistinct</> = -0.5.
5893+
of rows in the table; for example, a column in whichabout 80% of the
5894+
values are nonnull and each nonnull value appears about twice on
5895+
average could be represented by<structfield>stadistinct</> = -0.4.
58965896
A zero value means the number of distinct values is unknown.
58975897
</entry>
58985898
</row>

‎src/backend/commands/analyze.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2043,8 +2043,11 @@ compute_distinct_stats(VacAttrStatsP stats,
20432043

20442044
if (nmultiple==0)
20452045
{
2046-
/* If we found no repeated values, assume it's a unique column */
2047-
stats->stadistinct=-1.0;
2046+
/*
2047+
* If we found no repeated non-null values, assume it's a unique
2048+
* column; but be sure to discount for any nulls we found.
2049+
*/
2050+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
20482051
}
20492052
elseif (track_cnt<track_max&&toowide_cnt==0&&
20502053
nmultiple==track_cnt)
@@ -2398,8 +2401,11 @@ compute_scalar_stats(VacAttrStatsP stats,
23982401

23992402
if (nmultiple==0)
24002403
{
2401-
/* If we found no repeated values, assume it's a unique column */
2402-
stats->stadistinct=-1.0;
2404+
/*
2405+
* If we found no repeated non-null values, assume it's a unique
2406+
* column; but be sure to discount for any nulls we found.
2407+
*/
2408+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
24032409
}
24042410
elseif (toowide_cnt==0&&nmultiple==ndistinct)
24052411
{
@@ -2703,7 +2709,7 @@ compute_scalar_stats(VacAttrStatsP stats,
27032709
else
27042710
stats->stawidth=stats->attrtype->typlen;
27052711
/* Assume all too-wide values are distinct, so it's a unique column */
2706-
stats->stadistinct=-1.0;
2712+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
27072713
}
27082714
elseif (null_cnt>0)
27092715
{

‎src/backend/tsearch/ts_typanalyze.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ compute_tsvector_stats(VacAttrStats *stats,
295295
stats->stawidth=total_width / (double)nonnull_cnt;
296296

297297
/* Assume it's a unique column (see notes above) */
298-
stats->stadistinct=-1.0;
298+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
299299

300300
/*
301301
* Construct an array of the interesting hashtable items, that is,

‎src/backend/utils/adt/rangetypes_typanalyze.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,9 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
203203
/* Do the simple null-frac and width stats */
204204
stats->stanullfrac= (double)null_cnt / (double)samplerows;
205205
stats->stawidth=total_width / (double)non_null_cnt;
206-
stats->stadistinct=-1.0;
206+
207+
/* Estimate that non-null values are unique */
208+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
207209

208210
/* Must copy the target values into anl_context */
209211
old_cxt=MemoryContextSwitchTo(stats->anl_context);

‎src/backend/utils/adt/selfuncs.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4684,14 +4684,16 @@ double
46844684
get_variable_numdistinct(VariableStatData*vardata,bool*isdefault)
46854685
{
46864686
doublestadistinct;
4687+
doublestanullfrac=0.0;
46874688
doublentuples;
46884689

46894690
*isdefault= false;
46904691

46914692
/*
46924693
* Determine the stadistinct value to use. There are cases where we can
46934694
* get an estimate even without a pg_statistic entry, or can get a better
4694-
* value than is in pg_statistic.
4695+
* value than is in pg_statistic. Grab stanullfrac too if we can find it
4696+
* (otherwise, assume no nulls, for lack of any better idea).
46954697
*/
46964698
if (HeapTupleIsValid(vardata->statsTuple))
46974699
{
@@ -4700,6 +4702,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47004702

47014703
stats= (Form_pg_statistic)GETSTRUCT(vardata->statsTuple);
47024704
stadistinct=stats->stadistinct;
4705+
stanullfrac=stats->stanullfrac;
47034706
}
47044707
elseif (vardata->vartype==BOOLOID)
47054708
{
@@ -4723,7 +4726,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47234726
{
47244727
caseObjectIdAttributeNumber:
47254728
caseSelfItemPointerAttributeNumber:
4726-
stadistinct=-1.0;/* unique */
4729+
stadistinct=-1.0;/* unique(and all non null)*/
47274730
break;
47284731
caseTableOidAttributeNumber:
47294732
stadistinct=1.0;/* only 1 value */
@@ -4745,10 +4748,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47454748
* If there is a unique index or DISTINCT clause for the variable, assume
47464749
* it is unique no matter what pg_statistic says; the statistics could be
47474750
* out of date, or we might have found a partial unique index that proves
4748-
* the var is unique for this query.
4751+
* the var is unique for this query. However, we'd better still believe
4752+
* the null-fraction statistic.
47494753
*/
47504754
if (vardata->isunique)
4751-
stadistinct=-1.0;
4755+
stadistinct=-1.0* (1.0-stanullfrac);
47524756

47534757
/*
47544758
* If we had an absolute estimate, use that.

‎src/include/catalog/pg_statistic.h

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,13 +57,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS
5757
*> 0actual number of distinct values
5858
*< 0negative of multiplier for number of rows
5959
* The special negative case allows us to cope with columns that are
60-
* unique (stadistinct = -1) or nearly so (for example, a column in
61-
* which values appear about twice on the average could be represented
62-
* by stadistinct = -0.5). Because the number-of-rows statistic in
63-
* pg_class may be updated more frequently than pg_statistic is, it's
64-
* important to be able to describe such situations as a multiple of
65-
* the number of rows, rather than a fixed number of distinct values.
66-
* But in other cases a fixed number is correct (eg, a boolean column).
60+
* unique (stadistinct = -1) or nearly so (for example, a column in which
61+
* non-null values appear about twice on the average could be represented
62+
* by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
63+
* column is nulls). Because the number-of-rows statistic in pg_class may
64+
* be updated more frequently than pg_statistic is, it's important to be
65+
* able to describe such situations as a multiple of the number of rows,
66+
* rather than a fixed number of distinct values. But in other cases a
67+
* fixed number is correct (eg, a boolean column).
6768
* ----------------
6869
*/
6970
float4stadistinct;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp