Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitc10bec2

Browse files
Lev Kokotovgitbook-bot
Lev Kokotov
authored andcommitted
GITBOOK-65: change request with no subject merged in GitBook
1 parent5b69235 commitc10bec2

File tree

5 files changed

+141
-30
lines changed

5 files changed

+141
-30
lines changed

‎pgml-docs/docs/guides/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
*[Tabular data](data-storage-and-retrieval/tabular-data.md)
5454
*[Vectors](data-storage-and-retrieval/vectors.md)
5555
*[Partitioning](data-storage-and-retrieval/partitioning.md)
56+
*[Documents](data-storage-and-retrieval/documents.md)
5657
*[Deploying PostgresML](deploying-postgresml/README.md)
5758
*[PostgresML Cloud](deploying-postgresml/postgresml-cloud/README.md)
5859
*[Plans](deploying-postgresml/postgresml-cloud/plans/README.md)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#Documents
2+
3+
WIP:[https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,109 @@
11
#Partitioning
22

3+
Partitioning is the act of splitting Posgres tables into multiple smaller tables, which allows to query each smaller table independently. This is useful and sometimes necessary when tables get so large that querying a single table becomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres execute faster queries.
4+
5+
###Partitioning schemes
6+
7+
Postgres supports three (3) kinds of partitioning schemes: by range, by list, and by hash. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.
8+
9+
###Partition by range
10+
11+
Partition by range operates on numerical values. Dates, numbers and vectors can be used as range partition keys because their range of values can be split into non-overlapping parts.
12+
13+
For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), we can create three (3) partitions with the following bounds:
14+
15+
* partition 1 will contain all dates prior to January 1, 2000,
16+
* partition 2 will contain all dates between January 1, 2000 and December 31, 2020,
17+
* partition 3 will contain all dates after January 1, 2021.
18+
19+
While these ranges are not even, we chose them because of some knowledge we have about our dataset. In our hypothetical example, we know that these date ranges will split our dataset into roughly three (3) evenly sized tables.
20+
21+
####Building partitions
22+
23+
Let's build some real partitions with a dataset from Kaggle:[Hourly Energy Consumption](https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption).
24+
25+
You can create a partition by range in Postgres with just a few queries. Partitioning requires two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.
26+
27+
Let's start with the parent table:
28+
29+
```sql
30+
CREATETABLEenergy_consumption (
31+
"Datetime"TIMESTAMPTZ,
32+
"AEP_MW"REAL
33+
) PARTITION BY RANGE("Datetime");
34+
```
35+
36+
Now, let's add a couple child tables:
37+
38+
```sql
39+
CREATETABLEenergy_consumption_2004_2011
40+
PARTITION OF energy_consumption
41+
FORVALUESFROM ('2004-01-01') TO ('2011-12-31');
42+
43+
CREATETABLEenergy_consumption_2012_2018
44+
PARTITION OF energy_consumption
45+
FORVALUESFROM ('2011-12-31') TO ('2018-12-31');
46+
```
47+
48+
Postgres partition bounds are defined as`[start, end)`, which means the start of the range is included and the end of the range is excluded.
49+
50+
Let's ingest the dataset into our partitioned table and see what we get:
51+
52+
```
53+
postgresml=# \copy energy_consumption FROM 'AEP_hourly.csv' CSV HEADER;
54+
COPY 121273
55+
```
56+
57+
We have a grand total of 121,273 rows. If we partitioned the dataset correctly, the two child tables should have roughly the same number of rows:
58+
59+
```
60+
postgresml=# SELECT count(*) FROM energy_consumption_2004_2011;
61+
count
62+
-------
63+
63511
64+
65+
postgresml=# SELECT count(*) FROM energy_consumption_2012_2018;
66+
count
67+
-------
68+
57762
69+
```
70+
71+
Nicely done. The two tables are pretty close to each other which creates a roughly even distribution of data in our partitioning scheme.
72+
73+
Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, we could continue to add partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.
74+
75+
To make this user friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:
76+
77+
```sql
78+
SELECT
79+
avg("AEP_MW")
80+
FROM energy_consumption
81+
WHERE"Datetime" BETWEEN'2004-01-01'AND'2005-01-01';
82+
```
83+
84+
```
85+
avg
86+
--------------------
87+
15175.689170820118
88+
```
89+
90+
If we look at the query plan, we'll see that Postgres only queries the first child table we created:
91+
92+
```
93+
postgresml=# EXPLAIN SELECT
94+
avg("AEP_MW")
95+
FROM energy_consumption
96+
WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';
97+
98+
QUERY PLAN
99+
----------------------------------------------------------------------------
100+
Aggregate (cost=10000001302.18..10000001302.19 rows=1 width=8)
101+
-> Seq Scan on energy_consumption_2004_2011 energy_consumption (...)
102+
Filter: [...]
103+
```
104+
105+
which reduces the number of rows it has to scan by half. By adding more smaller partitions, we can significantly reduce the amount of data Postgres needs to scan to execute a query. That being said, scanning multiple tables could be more expensive than scanning just one table, so adding too many partitions quickly reduces its benefits if the queries need to scan more than a few child tables.
106+
107+
###Partition by hash
108+
109+
Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is applied to the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.

‎pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -176,11 +176,11 @@ Postgres supports many more indexing algorithms, e.g. GiST, BRIN, GIN, and Hash.
176176

177177
###Accelerating recall
178178

179-
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow.To help withthat, tables should have indexes that order and organize commonly read columns. Searchinga B-Treeindex can be done in_O(log n)_ time, which is orders of magnitude faster than the_O(n)_ full tablesearch.
179+
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow.Atthat point, tables should have indexes that order and organize commonly read columns. Searchinganindex can be done in_O(log n)_ time, which is orders of magnitude faster than the_O(n)_ full tablescan.
180180

181181
####Querying an index
182182

183-
Postgres automatically uses indexes when its possible and optimal to do so. From our example, wecanfilter the dataset by the "Address" column,and we can do so very quickly because of the index we created:
183+
Postgres automatically uses indexes when its possible and optimal to do so. From our example,ifwe filter the dataset by the "Address" column,Postgres will use the index we created and return a result quickly:
184184

185185
```sql
186186
SELECT
@@ -199,9 +199,9 @@ which produces
199199
(1 row)
200200
```
201201

202-
Since we have a unique index on the table, wewill only see one row with that address.
202+
Since we have a unique index on the table, weexpect to see only one row with that address.
203203

204-
To double check that Postgres is using an index, we cancheckthe query execution plan. A query plan is a list of steps that Postgres will take to get thequeryresultwe requested. To see the query plan, prepend the keyword`EXPLAIN` to the query you're running:
204+
To double check that Postgres is using an index, we cantake a look atthe query execution plan. A query plan is a list of steps that Postgres will take to get the resultof the query. To see the query plan, prepend the keyword`EXPLAIN` to the query you'd like to run:
205205

206206
```
207207
postgresml=# EXPLAIN (FORMAT JSON) SELECT
@@ -232,12 +232,10 @@ WHERE "Address" = '1 Infinite Loop, Cupertino, California';
232232
]
233233
```
234234

235-
The plan indicates it will use an "Index Scan" onthe index`usa_house_prices_Address_index` which is what we're expecting.
235+
The plan indicatesthatit will use an "Index Scan" on`usa_house_prices_Address_index` which is what we're expecting. Using`EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
236236

237-
Using`EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
238-
239-
The ability to create indexes on datasets of any size, and to then efficiently query that data with those indexes is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.
237+
The ability to create indexes on datasets of any size, and to efficiently query that data using indexes, is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.
240238

241239
####Maintaining an index
242240

243-
Postgres indexes require no special maintenance. They are automatically updated when data is addedor removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant. The database guarantees that the data is always consistent, no matter how many concurrent changes are made.
241+
Postgres indexes require no special maintenance. They are automatically updated when data is addedand removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant: the database guarantees that the data is always consistent, no matter how many concurrent changes are made.

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp