- Notifications
You must be signed in to change notification settings - Fork28
Commit3d3bf62
committed
Omit null rows when setting the threshold for what's a most-common value.
As with the previous patch, large numbers of null rows could skew thiscalculation unfavorably, causing us to discard values that have alegitimate claim to be MCVs, since our definition of MCV is that it'smost common among the non-null population of the column. Hence, makethe numerator of avgcount be the number of non-null sample values notthe number of sample rows; likewise for maxmincount in thecompute_scalar_stats variant.Also, make the denominator be the number of distinct values actuallyobserved in the sample, rather than reversing it back out of the computedstadistinct. This avoids depending on the accuracy of the Haas-Stokesapproximation, and really it's what we want anyway; the threshold shoulddepend only on what we see in the sample, not on what we extrapolateabout the contents of the whole column.Alex Shulgin, reviewed by Tomas Vondra and myself1 parent5cb8826 commit3d3bf62
1 file changed
+9
-11
lines changedLines changed: 9 additions & 11 deletions
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
2133 | 2133 |
| |
2134 | 2134 |
| |
2135 | 2135 |
| |
2136 |
| - | |
| 2136 | + | |
| 2137 | + | |
2137 | 2138 |
| |
2138 | 2139 |
| |
2139 | 2140 |
| |
2140 |
| - | |
2141 |
| - | |
2142 |
| - | |
2143 |
| - | |
| 2141 | + | |
| 2142 | + | |
2144 | 2143 |
| |
2145 | 2144 |
| |
2146 | 2145 |
| |
| |||
2494 | 2493 |
| |
2495 | 2494 |
| |
2496 | 2495 |
| |
2497 |
| - | |
| 2496 | + | |
| 2497 | + | |
2498 | 2498 |
| |
2499 | 2499 |
| |
2500 | 2500 |
| |
2501 | 2501 |
| |
2502 |
| - | |
2503 |
| - | |
2504 |
| - | |
2505 |
| - | |
| 2502 | + | |
| 2503 | + | |
2506 | 2504 |
| |
2507 | 2505 |
| |
2508 | 2506 |
| |
2509 | 2507 |
| |
2510 | 2508 |
| |
2511 |
| - | |
| 2509 | + | |
2512 | 2510 |
| |
2513 | 2511 |
| |
2514 | 2512 |
| |
|
0 commit comments
Comments
(0)