NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commitb5db1d9

committed

Improve ANALYZE's strategy for finding MCVs.

Previously, a value was included in the MCV list if its frequency was25% larger than the estimated average frequency of all nonnull valuesin the table. For uniform distributions, that can lead to valuesbeing included in the MCV list and significantly overestimated on thebasis of relatively few (sometimes just 2) instances being seen in thesample. For non-uniform distributions, it can lead to too few valuesbeing included in the MCV list, since the overall average frequencymay be dominated by a small number of very common values, while theremaining values may still have a large spread of frequencies, causingboth substantial overestimation and underestimation of the remainingvalues. Furthermore, increasing the statistics target may have littleeffect because the overall average frequency will remain relativelyunchanged.Instead, populate the MCV list with the largest set of common valuesthat are statistically significantly more common than the averagefrequency of the remaining values. This takes into account thevariance of the sample counts, which depends on the counts themselvesand on the proportion of the table that was sampled. As a result, itconstrains the relative standard error of estimates based on thefrequencies of values in the list, reducing the chances of too manyvalues being included. At the same time, it allows more values to beincluded, since the MCVs need only be more common than the remainingnon-MCVs, rather than the overall average. Thus it tends to producefewer MCVs than the previous code for uniform distributions, and morefor non-uniform distributions, reducing estimation errors in bothcases. In addition, the algorithm responds better to increasing thestatistics target, allowing more values to be included in the MCV listwhen more of the table is sampled.Jeff Janes, substantially modified by me. Reviewed by John Naylor andTomas Vondra.Discussion:https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com

1 parent31bc604 commitb5db1d9Copy full SHA for b5db1d9

File tree

1 file changed

+158

-58

lines changed

src/backend/commands
- analyze.c

1 file changed

+158

-58

lines changed

`‎src/backend/commands/analyze.c‎`

Lines changed: 158 additions & 58 deletions

Original file line number	Diff line number	Diff line change
`@@ -1769,6 +1769,12 @@ static void compute_scalar_stats(VacAttrStatsP stats,`
`1769`	`1769`	`doubletotalrows);`
`1770`	`1770`	`staticintcompare_scalars(constvoida,constvoidb,void*arg);`
`1771`	`1771`	`staticintcompare_mcvs(constvoida,constvoidb);`
	`1772`	`+staticintanalyze_mcv_list(int*mcv_counts,`
	`1773`	`+intnum_mcv,`
	`1774`	`+doublestadistinct,`
	`1775`	`+doublestanullfrac,`
	`1776`	`+intsamplerows,`
	`1777`	`+doubletotalrows);`
`1772`	`1778`
`1773`	`1779`
`1774`	`1780`	`/*`
`@@ -2187,9 +2193,7 @@ compute_distinct_stats(VacAttrStatsP stats,`
`2187`	`2193`	`* we are able to generate a complete MCV list (all the values in the`
`2188`	`2194`	`* sample will fit, and we think these are all the ones in the table),`
`2189`	`2195`	`* then do so. Otherwise, store only those values that are`
`2190`		`- * significantly more common than the (estimated) average. We set the`
`2191`		`- * threshold rather arbitrarily at 25% more than average, with at`
`2192`		`- * least 2 instances in the sample.`
	`2196`	`+ * significantly more common than the values not in the list.`
`2193`	`2197`	`*`
`2194`	`2198`	`* Note: the first of these cases is meant to address columns with`
`2195`	`2199`	`* small, fixed sets of possible values, such as boolean or enum`
`@@ -2198,8 +2202,7 @@ compute_distinct_stats(VacAttrStatsP stats,`
`2198`	`2202`	`* so and thus provide the planner with complete information. But if`
`2199`	`2203`	`* the MCV list is not complete, it's generally worth being more`
`2200`	`2204`	`* selective, and not just filling it all the way up to the stats`
`2201`		`- * target. So for an incomplete list, we try to take only MCVs that`
`2202`		`- * are significantly more common than average.`
	`2205`	`+ * target.`
`2203`	`2206`	`*/`
`2204`	`2207`	`if (track_cnt<track_max&&toowide_cnt==0&&`
`2205`	`2208`	`stats->stadistinct>0&&`
`@@ -2210,28 +2213,22 @@ compute_distinct_stats(VacAttrStatsP stats,`
`2210`	`2213`	`}`
`2211`	`2214`	`else`
`2212`	`2215`	`{`
`2213`		`-doublendistinct_table=stats->stadistinct;`
`2214`		`-doubleavgcount,`
`2215`		`-mincount;`
`2216`		`-`
`2217`		`-/* Re-extract estimate of # distinct nonnull values in table */`
`2218`		`-if (ndistinct_table<0)`
`2219`		`-ndistinct_table=-ndistinct_table*totalrows;`
`2220`		`-/* estimate # occurrences in sample of a typical nonnull value */`
`2221`		`-avgcount= (double)nonnull_cnt /ndistinct_table;`
`2222`		`-/* set minimum threshold count to store a value */`
`2223`		`-mincount=avgcount*1.25;`
`2224`		`-if (mincount<2)`
`2225`		`-mincount=2;`
	`2216`	`+int*mcv_counts;`
	`2217`	`+`
	`2218`	`+/* Incomplete list; decide how many values are worth keeping */`
`2226`	`2219`	`if (num_mcv>track_cnt)`
`2227`	`2220`	`num_mcv=track_cnt;`
`2228`		`-for (i=0;i<num_mcv;i++)`
	`2221`	`+`
	`2222`	`+if (num_mcv>0)`
`2229`	`2223`	`{`
`2230`		`-if (track[i].count<mincount)`
`2231`		`-{`
`2232`		`-num_mcv=i;`
`2233`		`-break;`
`2234`		`-}`
	`2224`	`+mcv_counts= (int)palloc(num_mcvsizeof(int));`
	`2225`	`+for (i=0;i<num_mcv;i++)`
	`2226`	`+mcv_counts[i]=track[i].count;`
	`2227`	`+`
	`2228`	`+num_mcv=analyze_mcv_list(mcv_counts,num_mcv,`
	`2229`	`+stats->stadistinct,`
	`2230`	`+stats->stanullfrac,`
	`2231`	`+samplerows,totalrows);`
`2235`	`2232`	`}`
`2236`	`2233`	`}`
`2237`	`2234`
`@@ -2561,14 +2558,7 @@ compute_scalar_stats(VacAttrStatsP stats,`
`2561`	`2558`	`* we are able to generate a complete MCV list (all the values in the`
`2562`	`2559`	`* sample will fit, and we think these are all the ones in the table),`
`2563`	`2560`	`* then do so. Otherwise, store only those values that are`
`2564`		`- * significantly more common than the (estimated) average. We set the`
`2565`		`- * threshold rather arbitrarily at 25% more than average, with at`
`2566`		`- * least 2 instances in the sample. Also, we won't suppress values`
`2567`		`- * that have a frequency of at least 1/K where K is the intended`
`2568`		`- * number of histogram bins; such values might otherwise cause us to`
`2569`		`- * emit duplicate histogram bin boundaries. (We might end up with`
`2570`		`- * duplicate histogram entries anyway, if the distribution is skewed;`
`2571`		`- * but we prefer to treat such values as MCVs if at all possible.)`
	`2561`	`+ * significantly more common than the values not in the list.`
`2572`	`2562`	`*`
`2573`	`2563`	`* Note: the first of these cases is meant to address columns with`
`2574`	`2564`	`* small, fixed sets of possible values, such as boolean or enum`
`@@ -2577,8 +2567,7 @@ compute_scalar_stats(VacAttrStatsP stats,`
`2577`	`2567`	`* so and thus provide the planner with complete information. But if`
`2578`	`2568`	`* the MCV list is not complete, it's generally worth being more`
`2579`	`2569`	`* selective, and not just filling it all the way up to the stats`
`2580`		`- * target. So for an incomplete list, we try to take only MCVs that`
`2581`		`- * are significantly more common than average.`
	`2570`	`+ * target.`
`2582`	`2571`	`*/`
`2583`	`2572`	`if (track_cnt==ndistinct&&toowide_cnt==0&&`
`2584`	`2573`	`stats->stadistinct>0&&`
`@@ -2589,33 +2578,22 @@ compute_scalar_stats(VacAttrStatsP stats,`
`2589`	`2578`	`}`
`2590`	`2579`	`else`
`2591`	`2580`	`{`
`2592`		`-doublendistinct_table=stats->stadistinct;`
`2593`		`-doubleavgcount,`
`2594`		`-mincount,`
`2595`		`-maxmincount;`
`2596`		`-`
`2597`		`-/* Re-extract estimate of # distinct nonnull values in table */`
`2598`		`-if (ndistinct_table<0)`
`2599`		`-ndistinct_table=-ndistinct_table*totalrows;`
`2600`		`-/* estimate # occurrences in sample of a typical nonnull value */`
`2601`		`-avgcount= (double)nonnull_cnt /ndistinct_table;`
`2602`		`-/* set minimum threshold count to store a value */`
`2603`		`-mincount=avgcount*1.25;`
`2604`		`-if (mincount<2)`
`2605`		`-mincount=2;`
`2606`		`-/* don't let threshold exceed 1/K, however */`
`2607`		`-maxmincount= (double)values_cnt / (double)num_bins;`
`2608`		`-if (mincount>maxmincount)`
`2609`		`-mincount=maxmincount;`
	`2581`	`+int*mcv_counts;`
	`2582`	`+`
	`2583`	`+/* Incomplete list; decide how many values are worth keeping */`
`2610`	`2584`	`if (num_mcv>track_cnt)`
`2611`	`2585`	`num_mcv=track_cnt;`
`2612`		`-for (i=0;i<num_mcv;i++)`
	`2586`	`+`
	`2587`	`+if (num_mcv>0)`
`2613`	`2588`	`{`
`2614`		`-if (track[i].count<mincount)`
`2615`		`-{`
`2616`		`-num_mcv=i;`
`2617`		`-break;`
`2618`		`-}`
	`2589`	`+mcv_counts= (int)palloc(num_mcvsizeof(int));`
	`2590`	`+for (i=0;i<num_mcv;i++)`
	`2591`	`+mcv_counts[i]=track[i].count;`
	`2592`	`+`
	`2593`	`+num_mcv=analyze_mcv_list(mcv_counts,num_mcv,`
	`2594`	`+stats->stadistinct,`
	`2595`	`+stats->stanullfrac,`
	`2596`	`+samplerows,totalrows);`
`2619`	`2597`	`}`
`2620`	`2598`	`}`
`2621`	`2599`
`@@ -2881,3 +2859,125 @@ compare_mcvs(const void a, const void b)`
`2881`	`2859`
`2882`	`2860`	`returnda-db;`
`2883`	`2861`	`}`
	`2862`	`+`
	`2863`	`+/*`
	`2864`	`+ * Analyze the list of common values in the sample and decide how many are`
	`2865`	`+ * worth storing in the table's MCV list.`
	`2866`	`+ *`
	`2867`	`+ * mcv_counts is assumed to be a list of the counts of the most common values`
	`2868`	`+ * seen in the sample, starting with the most common. The return value is the`
	`2869`	`+ * number that are significantly more common than the values not in the list,`
	`2870`	`+ * and which are therefore deemed worth storing in the table's MCV list.`
	`2871`	`+ */`
	`2872`	`+staticint`
	`2873`	`+analyze_mcv_list(int*mcv_counts,`
	`2874`	`+intnum_mcv,`
	`2875`	`+doublestadistinct,`
	`2876`	`+doublestanullfrac,`
	`2877`	`+intsamplerows,`
	`2878`	`+doubletotalrows)`
	`2879`	`+{`
	`2880`	`+doublendistinct_table;`
	`2881`	`+doublesumcount;`
	`2882`	`+inti;`
	`2883`	`+`
	`2884`	`+/*`
	`2885`	`+ * If the entire table was sampled, keep the whole list. This also`
	`2886`	`+ * protects us against division by zero in the code below.`
	`2887`	`+ */`
	`2888`	`+if (samplerows==totalrows\|\|totalrows <=1.0)`
	`2889`	`+returnnum_mcv;`
	`2890`	`+`
	`2891`	`+/* Re-extract the estimated number of distinct nonnull values in table */`
	`2892`	`+ndistinct_table=stadistinct;`
	`2893`	`+if (ndistinct_table<0)`
	`2894`	`+ndistinct_table=-ndistinct_table*totalrows;`
	`2895`	`+`
	`2896`	`+/*`
	`2897`	`+ * Exclude the least common values from the MCV list, if they are not`
	`2898`	`+ * significantly more common than the estimated selectivity they would`
	`2899`	`+ * have if they weren't in the list. All non-MCV values are assumed to be`
	`2900`	`+ * equally common, after taking into account the frequencies of all the`
	`2901`	`+ * the values in the MCV list and the number of nulls (c.f. eqsel()).`
	`2902`	`+ *`
	`2903`	`+ * Here sumcount tracks the total count of all but the last (least common)`
	`2904`	`+ * value in the MCV list, allowing us to determine the effect of excluding`
	`2905`	`+ * that value from the list.`
	`2906`	`+ *`
	`2907`	`+ * Note that we deliberately do this by removing values from the full`
	`2908`	`+ * list, rather than starting with an empty list and adding values,`
	`2909`	`+ * because the latter approach can fail to add any values if all the most`
	`2910`	`+ * common values have around the same frequency and make up the majority`
	`2911`	`+ * of the table, so that the overall average frequency of all values is`
	`2912`	`+ * roughly the same as that of the common values. This would lead to any`
	`2913`	`+ * uncommon values being significantly overestimated.`
	`2914`	`+ */`
	`2915`	`+sumcount=0.0;`
	`2916`	`+for (i=0;i<num_mcv-1;i++)`
	`2917`	`+sumcount+=mcv_counts[i];`
	`2918`	`+`
	`2919`	`+while (num_mcv>0)`
	`2920`	`+{`
	`2921`	`+doubleselec,`
	`2922`	`+otherdistinct,`
	`2923`	`+N,`
	`2924`	`+n,`
	`2925`	`+K,`
	`2926`	`+variance,`
	`2927`	`+stddev;`
	`2928`	`+`
	`2929`	`+/*`
	`2930`	`+ * Estimated selectivity the least common value would have if it`
	`2931`	`+ * wasn't in the MCV list (c.f. eqsel()).`
	`2932`	`+ */`
	`2933`	`+selec=1.0-sumcount /samplerows-stanullfrac;`
	`2934`	`+if (selec<0.0)`
	`2935`	`+selec=0.0;`
	`2936`	`+if (selec>1.0)`
	`2937`	`+selec=1.0;`
	`2938`	`+otherdistinct=ndistinct_table- (num_mcv-1);`
	`2939`	`+if (otherdistinct>1)`
	`2940`	`+selec /=otherdistinct;`
	`2941`	`+`
	`2942`	`+/*`
	`2943`	`+ * If the value is kept in the MCV list, its population frequency is`
	`2944`	`+ * assumed to equal its sample frequency. We use the lower end of a`
	`2945`	`+ * textbook continuity-corrected Wald-type confidence interval to`
	`2946`	`+ * determine if that is significantly more common than the non-MCV`
	`2947`	`+ * frequency --- specifically we assume the population frequency is`
	`2948`	`+ * highly likely to be within around 2 standard errors of the sample`
	`2949`	`+ * frequency, which equates to an interval of 2 standard deviations`
	`2950`	`+ * either side of the sample count, plus an additional 0.5 for the`
	`2951`	`+ * continuity correction. Since we are sampling without replacement,`
	`2952`	`+ * this is a hypergeometric distribution.`
	`2953`	`+ *`
	`2954`	`+ * XXX: Empirically, this approach seems to work quite well, but it`
	`2955`	`+ * may be worth considering more advanced techniques for estimating`
	`2956`	`+ * the confidence interval of the hypergeometric distribution.`
	`2957`	`+ */`
	`2958`	`+N=totalrows;`
	`2959`	`+n=samplerows;`
	`2960`	`+K=N*mcv_counts[num_mcv-1] /n;`
	`2961`	`+variance=nK (N-K)* (N-n) / (NN (N-1));`
	`2962`	`+stddev=sqrt(variance);`
	`2963`	`+`
	`2964`	`+if (mcv_counts[num_mcv-1]>selecsamplerows+2stddev+0.5)`
	`2965`	`+{`
	`2966`	`+/*`
	`2967`	`+ * The value is significantly more common than the non-MCV`
	`2968`	`+ * selectivity would suggest. Keep it, and all the other more`
	`2969`	`+ * common values in the list.`
	`2970`	`+ */`
	`2971`	`+break;`
	`2972`	`+}`
	`2973`	`+else`
	`2974`	`+{`
	`2975`	`+/* Discard this value and consider the next least common value */`
	`2976`	`+num_mcv--;`
	`2977`	`+if (num_mcv==0)`
	`2978`	`+break;`
	`2979`	`+sumcount-=mcv_counts[num_mcv-1];`
	`2980`	`+}`
	`2981`	`+}`
	`2982`	`+returnnum_mcv;`
	`2983`	`+}`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitb5db1d9

File tree

1 file changed

1 file changed

`‎src/backend/commands/analyze.c‎`

0 commit comments