Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit95bee94

Browse files
committed
Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set thecolumn's stadistinct value to -1.0, intending to indicate that the entriesare all distinct. But what this value actually means is that the numberof distinct values is 100% of the table's rowcount, and thus it wasoverestimating the number of distinct values by however many nulls thereare. This could lead to very poor selectivity estimates, as for examplein a recent report from Andreas Joseph Krogh. We should discount thestadistinct value by whatever we've estimated the nulls fraction to be.(That is what will happen if we choose to use a negative stadistinct fora column that does have repeated entries, so this code path was justinconsistent.)In addition to fixing the stadistinct entries stored by several differentANALYZE code paths, adjust the logic where get_variable_numdistinct()forces an "all distinct" estimate on the basis of finding a relevant uniqueindex. Unique indexes don't reject nulls, so there's no reason to assumethat the null fraction doesn't apply.Back-patch to all supported branches. Back-patching is a bit of a judgmentcall, but this problem seems to affect only a few users (else we'd haveidentified it long ago), and it's bad enough when it does happen thatdestabilizing plan choices in a worse direction seems unlikely.Patch by me, with documentation wording suggested by Dean RasheedReport: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>Discussion: <16143.1470350371@sss.pgh.pa.us>
1 parent8a8c6b5 commit95bee94

File tree

6 files changed

+34
-21
lines changed

6 files changed

+34
-21
lines changed

‎doc/src/sgml/catalogs.sgml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5849,9 +5849,9 @@
58495849
<entry>The number of distinct nonnull data values in the column.
58505850
A value greater than zero is the actual number of distinct values.
58515851
A value less than zero is the negative of a multiplier for the number
5852-
of rows in the table; for example, a column in whichvalues appear about
5853-
twice on the average could be represented by
5854-
<structfield>stadistinct</> = -0.5.
5852+
of rows in the table; for example, a column in whichabout 80% of the
5853+
values are nonnull and each nonnull value appears about twice on
5854+
average could be represented by<structfield>stadistinct</> = -0.4.
58555855
A zero value means the number of distinct values is unknown.
58565856
</entry>
58575857
</row>

‎src/backend/commands/analyze.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2049,8 +2049,11 @@ compute_distinct_stats(VacAttrStatsP stats,
20492049

20502050
if (nmultiple==0)
20512051
{
2052-
/* If we found no repeated values, assume it's a unique column */
2053-
stats->stadistinct=-1.0;
2052+
/*
2053+
* If we found no repeated non-null values, assume it's a unique
2054+
* column; but be sure to discount for any nulls we found.
2055+
*/
2056+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
20542057
}
20552058
elseif (track_cnt<track_max&&toowide_cnt==0&&
20562059
nmultiple==track_cnt)
@@ -2426,8 +2429,11 @@ compute_scalar_stats(VacAttrStatsP stats,
24262429

24272430
if (nmultiple==0)
24282431
{
2429-
/* If we found no repeated values, assume it's a unique column */
2430-
stats->stadistinct=-1.0;
2432+
/*
2433+
* If we found no repeated non-null values, assume it's a unique
2434+
* column; but be sure to discount for any nulls we found.
2435+
*/
2436+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
24312437
}
24322438
elseif (toowide_cnt==0&&nmultiple==ndistinct)
24332439
{
@@ -2753,7 +2759,7 @@ compute_scalar_stats(VacAttrStatsP stats,
27532759
else
27542760
stats->stawidth=stats->attrtype->typlen;
27552761
/* Assume all too-wide values are distinct, so it's a unique column */
2756-
stats->stadistinct=-1.0;
2762+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
27572763
}
27582764
elseif (null_cnt>0)
27592765
{

‎src/backend/tsearch/ts_typanalyze.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ compute_tsvector_stats(VacAttrStats *stats,
295295
stats->stawidth=total_width / (double)nonnull_cnt;
296296

297297
/* Assume it's a unique column (see notes above) */
298-
stats->stadistinct=-1.0;
298+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
299299

300300
/*
301301
* Construct an array of the interesting hashtable items, that is,

‎src/backend/utils/adt/rangetypes_typanalyze.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,9 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
203203
/* Do the simple null-frac and width stats */
204204
stats->stanullfrac= (double)null_cnt / (double)samplerows;
205205
stats->stawidth=total_width / (double)non_null_cnt;
206-
stats->stadistinct=-1.0;
206+
207+
/* Estimate that non-null values are unique */
208+
stats->stadistinct=-1.0* (1.0-stats->stanullfrac);
207209

208210
/* Must copy the target values into anl_context */
209211
old_cxt=MemoryContextSwitchTo(stats->anl_context);

‎src/backend/utils/adt/selfuncs.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4738,14 +4738,16 @@ double
47384738
get_variable_numdistinct(VariableStatData*vardata,bool*isdefault)
47394739
{
47404740
doublestadistinct;
4741+
doublestanullfrac=0.0;
47414742
doublentuples;
47424743

47434744
*isdefault= false;
47444745

47454746
/*
47464747
* Determine the stadistinct value to use. There are cases where we can
47474748
* get an estimate even without a pg_statistic entry, or can get a better
4748-
* value than is in pg_statistic.
4749+
* value than is in pg_statistic. Grab stanullfrac too if we can find it
4750+
* (otherwise, assume no nulls, for lack of any better idea).
47494751
*/
47504752
if (HeapTupleIsValid(vardata->statsTuple))
47514753
{
@@ -4754,6 +4756,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47544756

47554757
stats= (Form_pg_statistic)GETSTRUCT(vardata->statsTuple);
47564758
stadistinct=stats->stadistinct;
4759+
stanullfrac=stats->stanullfrac;
47574760
}
47584761
elseif (vardata->vartype==BOOLOID)
47594762
{
@@ -4777,7 +4780,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47774780
{
47784781
caseObjectIdAttributeNumber:
47794782
caseSelfItemPointerAttributeNumber:
4780-
stadistinct=-1.0;/* unique */
4783+
stadistinct=-1.0;/* unique(and all non null)*/
47814784
break;
47824785
caseTableOidAttributeNumber:
47834786
stadistinct=1.0;/* only 1 value */
@@ -4799,10 +4802,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47994802
* If there is a unique index or DISTINCT clause for the variable, assume
48004803
* it is unique no matter what pg_statistic says; the statistics could be
48014804
* out of date, or we might have found a partial unique index that proves
4802-
* the var is unique for this query.
4805+
* the var is unique for this query. However, we'd better still believe
4806+
* the null-fraction statistic.
48034807
*/
48044808
if (vardata->isunique)
4805-
stadistinct=-1.0;
4809+
stadistinct=-1.0* (1.0-stanullfrac);
48064810

48074811
/*
48084812
* If we had an absolute estimate, use that.

‎src/include/catalog/pg_statistic.h

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,13 +57,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS
5757
*> 0actual number of distinct values
5858
*< 0negative of multiplier for number of rows
5959
* The special negative case allows us to cope with columns that are
60-
* unique (stadistinct = -1) or nearly so (for example, a column in
61-
* which values appear about twice on the average could be represented
62-
* by stadistinct = -0.5). Because the number-of-rows statistic in
63-
* pg_class may be updated more frequently than pg_statistic is, it's
64-
* important to be able to describe such situations as a multiple of
65-
* the number of rows, rather than a fixed number of distinct values.
66-
* But in other cases a fixed number is correct (eg, a boolean column).
60+
* unique (stadistinct = -1) or nearly so (for example, a column in which
61+
* non-null values appear about twice on the average could be represented
62+
* by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
63+
* column is nulls). Because the number-of-rows statistic in pg_class may
64+
* be updated more frequently than pg_statistic is, it's important to be
65+
* able to describe such situations as a multiple of the number of rows,
66+
* rather than a fixed number of distinct values. But in other cases a
67+
* fixed number is correct (eg, a boolean column).
6768
* ----------------
6869
*/
6970
float4stadistinct;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp