PostgreSQL 9.4.1 Documentation | |||
---|---|---|---|
Prev | Up | Chapter 61. How the Planner Uses Statistics | Next |
61.1. Row Estimation Examples
The examples shown below use tables in thePostgreSQL regression test database. The outputs shown are taken from version 8.3. The behavior of earlier (or later) versions might vary. Note also that sinceANALYZE uses random sampling while producing statistics, the results will change slightly after any newANALYZE.
Let's start with a very simple query:
EXPLAIN SELECT * FROM tenk1; QUERY PLAN------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
How the planner determines the cardinality oftenk1 is covered inSection 14.2, but is repeated here for completeness. The number of pages and rows is looked up inpg_class:
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; relpages | reltuples----------+----------- 358 | 10000
These numbers are current as of the lastVACUUM orANALYZE on the table. The planner then fetches the actual current number of pages in the table (this is a cheap operation, not requiring a table scan). If that is different fromrelpages thenreltuples is scaled accordingly to arrive at a current number-of-rows estimate. In this case the value ofrelpages is up-to-date so the rows estimate is the same asreltuples.
Let's move on to an example with a range condition in itsWHERE clause:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; QUERY PLAN-------------------------------------------------------------------------------- Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244) Recheck Cond: (unique1 < 1000) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) Index Cond: (unique1 < 1000)
The planner examines theWHERE clause condition and looks up the selectivity function for the operator< inpg_operator. This is held in the columnoprrest, and the entry in this case isscalarltsel
. Thescalarltsel
function retrieves the histogram forunique1 frompg_statistics. For manual queries it is more convenient to look in the simplerpg_stats view:
SELECT histogram_bounds FROM pg_statsWHERE tablename='tenk1' AND attname='unique1'; histogram_bounds------------------------------------------------------ {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
Next the fraction of the histogram occupied by"< 1000" is worked out. This is the selectivity. The histogram divides the range into equal frequency buckets, so all we have to do is locate the bucket that our value is in and countpart of it andall of the ones before. The value 1000 is clearly in the second bucket (993-1997). Assuming a linear distribution of values inside each bucket, we can calculate the selectivity as:
selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets = (1 + (1000 - 993)/(1997 - 993))/10 = 0.100697
that is, one whole bucket plus a linear fraction of the second, divided by the number of buckets. The estimated number of rows can now be calculated as the product of the selectivity and the cardinality oftenk1:
rows = rel_cardinality * selectivity = 10000 * 0.100697 = 1007 (rounding off)
Next let's consider an example with an equality condition in itsWHERE clause:
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA'; QUERY PLAN---------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244) Filter: (stringu1 = 'CRAAAA'::name)
Again the planner examines theWHERE clause condition and looks up the selectivity function for=, which is SinceCRAAAA appears in the list of MCVs, the selectivity is merely the corresponding entry in the list of most common frequencies (MCFs): As before, the estimated number of rows is just the product of this with the cardinality oftenk1: Now consider the same query, but with a constant that is not in theMCV list: This is quite a different problem: how to estimate the selectivity when the value isnot in theMCV list. The approach is to use the fact that the value is not in the list, combined with the knowledge of the frequencies for all of theMCVs:eqsel
. For equality estimation the histogram is not useful; instead the list ofmost common values (MCVs) is used to determine the selectivity. Let's have a look at the MCVs, with some additional columns that will be useful later:SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_statsWHERE tablename='tenk1' AND attname='stringu1';null_frac | 0n_distinct | 676most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
selectivity = mcf[3] = 0.003
rows = 10000 * 0.003 = 30
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; QUERY PLAN---------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244) Filter: (stringu1 = 'xxx'::name)
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv) = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10) = 0.0014559