68.1. Row Estimation Examples#
The examples shown below use tables in thePostgreSQL regression test database. Note also that sinceANALYZE
uses random sampling while producing statistics, the results will change slightly after any newANALYZE
.
Let's start with a very simple query:
EXPLAIN SELECT * FROM tenk1; QUERY PLAN------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
How the planner determines the cardinality oftenk1
is covered inSection 14.2, but is repeated here for completeness. The number of pages and rows is looked up inpg_class
:
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; relpages | reltuples----------+----------- 358 | 10000
These numbers are current as of the lastVACUUM
orANALYZE
on the table. The planner then fetches the actual current number of pages in the table (this is a cheap operation, not requiring a table scan). If that is different fromrelpages
thenreltuples
is scaled accordingly to arrive at a current number-of-rows estimate. In the example above, the value ofrelpages
is up-to-date so the rows estimate is the same asreltuples
.
Let's move on to an example with a range condition in itsWHERE
clause:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; QUERY PLAN-------------------------------------------------------------------------------- Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244) Recheck Cond: (unique1 < 1000) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) Index Cond: (unique1 < 1000)
The planner examines theWHERE
clause condition and looks up the selectivity function for the operator<
inpg_operator
. This is held in the columnoprrest
, and the entry in this case isscalarltsel
. Thescalarltsel
function retrieves the histogram forunique1
frompg_statistic
. For manual queries it is more convenient to look in the simplerpg_stats
view:
SELECT histogram_bounds FROM pg_statsWHERE tablename='tenk1' AND attname='unique1'; histogram_bounds------------------------------------------------------ {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
Next the fraction of the histogram occupied by“< 1000” is worked out. This is the selectivity. The histogram divides the range into equal frequency buckets, so all we have to do is locate the bucket that our value is in and countpart of it andall of the ones before. The value 1000 is clearly in the second bucket (993–1997). Assuming a linear distribution of values inside each bucket, we can calculate the selectivity as:
selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets = (1 + (1000 - 993)/(1997 - 993))/10 = 0.100697
that is, one whole bucket plus a linear fraction of the second, divided by the number of buckets. The estimated number of rows can now be calculated as the product of the selectivity and the cardinality oftenk1
:
rows = rel_cardinality * selectivity = 10000 * 0.100697 = 1007 (rounding off)
Next let's consider an example with an equality condition in itsWHERE
clause:
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA'; QUERY PLAN---------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244) Filter: (stringu1 = 'CRAAAA'::name)
Again the planner examines the Since As before, the estimated number of rows is just the product of this with the cardinality of Now consider the same query, but with a constant that is not in theMCV list: This is quite a different problem: how to estimate the selectivity when the value isnot in theMCV list. The approach is to use the fact that the value is not in the list, combined with the knowledge of the frequencies for all of theMCVs:WHERE
clause condition and looks up the selectivity function for=
, which iseqsel
. For equality estimation the histogram is not useful; instead the list ofmost common values (MCVs) is used to determine the selectivity. Let's have a look at the MCVs, with some additional columns that will be useful later:SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_statsWHERE tablename='tenk1' AND attname='stringu1';null_frac | 0n_distinct | 676most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
CRAAAA
appears in the list of MCVs, the selectivity is merely the corresponding entry in the list of most common frequencies (MCFs):selectivity = mcf[3] = 0.003
tenk1
:rows = 10000 * 0.003 = 30
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; QUERY PLAN---------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244) Filter: (stringu1 = 'xxx'::name)
selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv) = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10) = 0.0014559