Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitbe4b4dc

Browse files
committed
Omit null rows when applying the Haas-Stokes estimator for ndistinct.
Previously, we included null rows in the values of n and N that wentinto the formula, which amounts to considering null as a value in itsown right; but the d and f1 values do not include nulls. This isinconsistent, and it contributes to significant underestimation ofndistinct when the column is mostly nulls. In any case stadistinctis defined as the number of distinct non-null values, so we shouldexclude nulls when doing this computation.This is an aboriginal bug in our application of the Haas-Stokes formula,but we'll refrain from back-patching for fear of destabilizing planchoices in released branches.While at it, make the code a bit more readable by omitting unnecessarycasts and intermediate variables.Observation and original patch by Tomas Vondra, adjusted to fix bothuses of the formula by Alex Shulgin, cosmetic improvements by me
1 parent82c83b3 commitbe4b4dc

File tree

1 file changed

+38
-24
lines changed

1 file changed

+38
-24
lines changed

‎src/backend/commands/analyze.c

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -2072,6 +2072,12 @@ compute_distinct_stats(VacAttrStatsP stats,
20722072
* recommend are considerably more complex, and are numerically
20732073
* very unstable when n is much smaller than N.
20742074
*
2075+
* In this calculation, we consider only non-nulls. We used to
2076+
* include rows with null values in the n and N counts, but that
2077+
* leads to inaccurate answers in columns with many nulls, and
2078+
* it's intuitively bogus anyway considering the desired result is
2079+
* the number of distinct non-null values.
2080+
*
20752081
* We assume (not very reliably!) that all the multiply-occurring
20762082
* values are reflected in the final track[] list, and the other
20772083
* nonnull values all appeared but once. (XXX this usually
@@ -2081,21 +2087,22 @@ compute_distinct_stats(VacAttrStatsP stats,
20812087
*/
20822088
intf1=nonnull_cnt-summultiple;
20832089
intd=f1+nmultiple;
2084-
doublenumer,
2085-
denom,
2086-
stadistinct;
2087-
2088-
numer= (double)samplerows*(double)d;
2090+
doublen=samplerows-null_cnt;
2091+
doubleN=totalrows* (1.0-stats->stanullfrac);
2092+
doublestadistinct;
20892093

2090-
denom= (double) (samplerows-f1)+
2091-
(double)f1*(double)samplerows /totalrows;
2094+
/* N == 0 shouldn't happen, but just in case ... */
2095+
if (N>0)
2096+
stadistinct= (n*d) / ((n-f1)+f1*n /N);
2097+
else
2098+
stadistinct=0;
20922099

2093-
stadistinct=numer /denom;
20942100
/* Clamp to sane range in case of roundoff error */
2095-
if (stadistinct< (double)d)
2096-
stadistinct= (double)d;
2097-
if (stadistinct>totalrows)
2098-
stadistinct=totalrows;
2101+
if (stadistinct<d)
2102+
stadistinct=d;
2103+
if (stadistinct>N)
2104+
stadistinct=N;
2105+
/* And round to integer */
20992106
stats->stadistinct=floor(stadistinct+0.5);
21002107
}
21012108

@@ -2425,26 +2432,33 @@ compute_scalar_stats(VacAttrStatsP stats,
24252432
* recommend are considerably more complex, and are numerically
24262433
* very unstable when n is much smaller than N.
24272434
*
2435+
* In this calculation, we consider only non-nulls. We used to
2436+
* include rows with null values in the n and N counts, but that
2437+
* leads to inaccurate answers in columns with many nulls, and
2438+
* it's intuitively bogus anyway considering the desired result is
2439+
* the number of distinct non-null values.
2440+
*
24282441
* Overwidth values are assumed to have been distinct.
24292442
*----------
24302443
*/
24312444
intf1=ndistinct-nmultiple+toowide_cnt;
24322445
intd=f1+nmultiple;
2433-
doublenumer,
2434-
denom,
2435-
stadistinct;
2436-
2437-
numer= (double)samplerows*(double)d;
2446+
doublen=samplerows-null_cnt;
2447+
doubleN=totalrows* (1.0-stats->stanullfrac);
2448+
doublestadistinct;
24382449

2439-
denom= (double) (samplerows-f1)+
2440-
(double)f1*(double)samplerows /totalrows;
2450+
/* N == 0 shouldn't happen, but just in case ... */
2451+
if (N>0)
2452+
stadistinct= (n*d) / ((n-f1)+f1*n /N);
2453+
else
2454+
stadistinct=0;
24412455

2442-
stadistinct=numer /denom;
24432456
/* Clamp to sane range in case of roundoff error */
2444-
if (stadistinct< (double)d)
2445-
stadistinct= (double)d;
2446-
if (stadistinct>totalrows)
2447-
stadistinct=totalrows;
2457+
if (stadistinct<d)
2458+
stadistinct=d;
2459+
if (stadistinct>N)
2460+
stadistinct=N;
2461+
/* And round to integer */
24482462
stats->stadistinct=floor(stadistinct+0.5);
24492463
}
24502464

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp