|
1 | 1 | #Partitioning
|
2 | 2 |
|
| 3 | +Partitioning is the act of splitting Posgres tables into multiple smaller tables, which allows to query each smaller table independently. This is useful and sometimes necessary when tables get so large that querying a single table becomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres execute faster queries. |
| 4 | + |
| 5 | +###Partitioning schemes |
| 6 | + |
| 7 | +Postgres supports three (3) kinds of partitioning schemes: by range, by list, and by hash. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data. |
| 8 | + |
| 9 | +###Partition by range |
| 10 | + |
| 11 | +Partition by range operates on numerical values. Dates, numbers and vectors can be used as range partition keys because their range of values can be split into non-overlapping parts. |
| 12 | + |
| 13 | +For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), we can create three (3) partitions with the following bounds: |
| 14 | + |
| 15 | +* partition 1 will contain all dates prior to January 1, 2000, |
| 16 | +* partition 2 will contain all dates between January 1, 2000 and December 31, 2020, |
| 17 | +* partition 3 will contain all dates after January 1, 2021. |
| 18 | + |
| 19 | +While these ranges are not even, we chose them because of some knowledge we have about our dataset. In our hypothetical example, we know that these date ranges will split our dataset into roughly three (3) evenly sized tables. |
| 20 | + |
| 21 | +####Building partitions |
| 22 | + |
| 23 | +Let's build some real partitions with a dataset from Kaggle:[Hourly Energy Consumption](https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption). |
| 24 | + |
| 25 | +You can create a partition by range in Postgres with just a few queries. Partitioning requires two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data. |
| 26 | + |
| 27 | +Let's start with the parent table: |
| 28 | + |
| 29 | +```sql |
| 30 | +CREATETABLEenergy_consumption ( |
| 31 | +"Datetime"TIMESTAMPTZ, |
| 32 | +"AEP_MW"REAL |
| 33 | +) PARTITION BY RANGE("Datetime"); |
| 34 | +``` |
| 35 | + |
| 36 | +Now, let's add a couple child tables: |
| 37 | + |
| 38 | +```sql |
| 39 | +CREATETABLEenergy_consumption_2004_2011 |
| 40 | +PARTITION OF energy_consumption |
| 41 | +FORVALUESFROM ('2004-01-01') TO ('2011-12-31'); |
| 42 | + |
| 43 | +CREATETABLEenergy_consumption_2012_2018 |
| 44 | +PARTITION OF energy_consumption |
| 45 | +FORVALUESFROM ('2011-12-31') TO ('2018-12-31'); |
| 46 | +``` |
| 47 | + |
| 48 | +Postgres partition bounds are defined as`[start, end)`, which means the start of the range is included and the end of the range is excluded. |
| 49 | + |
| 50 | +Let's ingest the dataset into our partitioned table and see what we get: |
| 51 | + |
| 52 | +``` |
| 53 | +postgresml=# \copy energy_consumption FROM 'AEP_hourly.csv' CSV HEADER; |
| 54 | +COPY 121273 |
| 55 | +``` |
| 56 | + |
| 57 | +We have a grand total of 121,273 rows. If we partitioned the dataset correctly, the two child tables should have roughly the same number of rows: |
| 58 | + |
| 59 | +``` |
| 60 | +postgresml=# SELECT count(*) FROM energy_consumption_2004_2011; |
| 61 | + count |
| 62 | +------- |
| 63 | + 63511 |
| 64 | +
|
| 65 | + postgresml=# SELECT count(*) FROM energy_consumption_2012_2018; |
| 66 | + count |
| 67 | +------- |
| 68 | + 57762 |
| 69 | +``` |
| 70 | + |
| 71 | +Nicely done. The two tables are pretty close to each other which creates a roughly even distribution of data in our partitioning scheme. |
| 72 | + |
| 73 | +Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, we could continue to add partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table. |
| 74 | + |
| 75 | +To make this user friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible: |
| 76 | + |
| 77 | +```sql |
| 78 | +SELECT |
| 79 | +avg("AEP_MW") |
| 80 | +FROM energy_consumption |
| 81 | +WHERE"Datetime" BETWEEN'2004-01-01'AND'2005-01-01'; |
| 82 | +``` |
| 83 | + |
| 84 | +``` |
| 85 | + avg |
| 86 | +-------------------- |
| 87 | + 15175.689170820118 |
| 88 | +``` |
| 89 | + |
| 90 | +If we look at the query plan, we'll see that Postgres only queries the first child table we created: |
| 91 | + |
| 92 | +``` |
| 93 | +postgresml=# EXPLAIN SELECT |
| 94 | + avg("AEP_MW") |
| 95 | +FROM energy_consumption |
| 96 | +WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01'; |
| 97 | +
|
| 98 | + QUERY PLAN |
| 99 | +---------------------------------------------------------------------------- |
| 100 | + Aggregate (cost=10000001302.18..10000001302.19 rows=1 width=8) |
| 101 | + -> Seq Scan on energy_consumption_2004_2011 energy_consumption (...) |
| 102 | + Filter: [...] |
| 103 | +``` |
| 104 | + |
| 105 | +which reduces the number of rows it has to scan by half. By adding more smaller partitions, we can significantly reduce the amount of data Postgres needs to scan to execute a query. That being said, scanning multiple tables could be more expensive than scanning just one table, so adding too many partitions quickly reduces its benefits if the queries need to scan more than a few child tables. |
| 106 | + |
| 107 | +###Partition by hash |
| 108 | + |
| 109 | +Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is applied to the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row. |