Movatterモバイル変換


[0]ホーム

URL:


leafo.net

My projects

View more →

Recent guides

View all →

Recent posts

Calculating Percentile (and Median) in PostgreSQL

PostedAugust 09, 2018 (UpdatedJune 05, 2019) by leafo (@moonscript) · Tags: postgresql
Tweet

Modern versions of PostgreSQL (≥ 9.4) make it easy to calculate the percentilefor a list of values at any percentage using thepercentile_cont andpercentile_discordered-set aggregation functions.

The two functions work very similarly, but differ in how they merge the final result:

  • percentile_disc will return a value from the input set closest to the percentile you request
  • percentile_cont will return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input

In all these examples we'll imagine we have a table calledthings with a column calledvalue.

Calculating The Median

The median is generally the 50th percentile. You can calculate it with the following query:

selectpercentile_disc(0.5)withingroup(orderbythings.value)fromthings

Thewithin group syntax is called anordered-set aggregate

For a percentile that lies between values from the input, usepercentile_contto get the interpolated result:

selectpercentile_const(0.5)withingroup(orderbythings.value)fromthings

What does interpolated mean? The easiest way to illustrate the differenceis to consider a list of 2 numbers: [1, 2].percentile_disc(0.5) (thediscete median) of this list will be1, the cloest value to the center thatis in the list. On the other hand,percentile_const(0.5) will return1.5,the average of the two numbers since there’s no individual item in the listthat represents the median.

Calculating A Perentile

Any percentile can be calculated by providing a fractional amount from 0 to 1.The percentile function can be used multiple times within the query. In thisexample, 3 quartiles are calculated.

selectpercentile_disc(0.25)withingroup(orderbythings.value),percentile_disc(0.5)withingroup(orderbythings.value),percentile_disc(0.75)withingroup(orderbythings.value)fromthings

within group can be used with other claues likegroup by. Here’s an examplewhere different percentiles are calculated for numbers less than 100, andeverything else.

selectthings.value<100asless_than_100,percentile_disc(0.75)withingroup(orderbythings.value)fromthingsgroupby1

group by 1 groups by the first expression in the select statement

Calculating All Percentiles

It might be useful to calculate what the percentile value is for each percent, 1to 100. This can be used to get an idea of the shape of your data, or you cancache it into a materialized view for fast percentile estimation for newvalues.

One approach is to combine thepercentile_ functions discussed above withgenerate_series. In this example, every percentile from 1 to 100 is returnedas individual rows:

selectk,percentile_disc(k)withingroup(orderbythings.value)fromthings,generate_series(0.01,1,0.01)askgroupbyk

generate_series generates temporary table with values between a starting andending value, with an optional step. In this example,generate_series(0.01, 1, 0.01) returns0.01,0.02, etc..

The arguments ofgenerate_series can be modified to specify whatpercentiles we want to calculate. For example, quartiles can be calculatedwithgenerate_series(0.25, 1, 0.25).

This method can be slow because it performs a percentile lookup over the entiredataset for each of the percentiles it calculates. For example, if you arecalculating 100 percentiles then it will scan your data 100 times.

Ideally we want to scan the data only once while keeping track of whatpercentile a particular value is in. PostgreSQL has awindowfunction thatcan be used to accomplish this:ntile.

ntile will assign each value in your dataset into bucket, and you specify howmany buckets there are.

The completed example requires a subquery, so we'll look at the inner queryfirst. In this example we'll calculate the 100 percentiles, so we'll usentile(100):

selectthings.value,ntile(100)over(orderbythings.value)fromthings

This query returns all of our values, along with what bucket each one belongsto. The next step is to interpret these buckets into percentile values. Becauseeach value has been evenly distributed over 50 buckets then we can look at themax value of the 50th bucket to see where the 50th percentile ends.Generalizing this, we can use an aggregate query to calculate themax(value)for each distinct bucket number to find all the percentiles!

Here’s the final query calculating all the percentiles from 1 to 100.

selectmax(buckets.value),ntileaspercentilefrom(selectthings.value,ntile(100)over(orderbythings.value)fromthings)asbucketsgroupby2orderby2

This should generally run about 100 times faster than the example usingpercentile_disc with 100 percentiles, since it will only scan the data onceinstead of 100 times.

The percentile is returned as an integer, instead of a fraction.You may need to adjust your query appropriately before consuming the resultselsewhere.

Keep in mind that if you want to calculate a number of percentiles other than100 then some math will be needed to convert thentile bucket to thepercentile. For example, when calculating quartiles you would usentile(4).Buckets 1,2,3,4 are assigned. To convert bucket number to percentile:ntile / 4.0.

Here are some more guides tagged 'postgresql'
PostedAugust 07, 2017
PostedJuly 04, 2015

leafo.net · Generated Sun Oct 8 13:02:35 2023 bySitegenmastodon.social/@leafo


[8]ページ先頭

©2009-2025 Movatter.jp