Commitc10bec2

Lev Kokotov

authored and

gitbook-bot

committed

GITBOOK-65: change request with no subject merged in GitBook

1 parent5b69235 commitc10bec2Copy full SHA for c10bec2

File tree

5 files changed

+141

-30

lines changed

pgml-docs/docs/guides
- SUMMARY.md
- data-storage-and-retrieval

5 files changed

+141

-30

lines changed

`‎pgml-docs/docs/guides/SUMMARY.md`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -53,6 +53,7 @@`
`53`	`53`	`*[Tabular data](data-storage-and-retrieval/tabular-data.md)`
`54`	`54`	`*[Vectors](data-storage-and-retrieval/vectors.md)`
`55`	`55`	`*[Partitioning](data-storage-and-retrieval/partitioning.md)`
	`56`	`+*[Documents](data-storage-and-retrieval/documents.md)`
`56`	`57`	`*[Deploying PostgresML](deploying-postgresml/README.md)`
`57`	`58`	`*[PostgresML Cloud](deploying-postgresml/postgresml-cloud/README.md)`
`58`	`59`	`*[Plans](deploying-postgresml/postgresml-cloud/plans/README.md)`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/documents.md`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+#Documents`
	`2`	`+`
	`3`	`+WIP:[https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md`

Lines changed: 107 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,109 @@`
`1`	`1`	`#Partitioning`
`2`	`2`
	`3`	`+Partitioning is the act of splitting Posgres tables into multiple smaller tables, which allows to query each smaller table independently. This is useful and sometimes necessary when tables get so large that querying a single table becomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres execute faster queries.`
	`4`	`+`
	`5`	`+###Partitioning schemes`
	`6`	`+`
	`7`	`+Postgres supports three (3) kinds of partitioning schemes: by range, by list, and by hash. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.`
	`8`	`+`
	`9`	`+###Partition by range`
	`10`	`+`
	`11`	`+Partition by range operates on numerical values. Dates, numbers and vectors can be used as range partition keys because their range of values can be split into non-overlapping parts.`
	`12`	`+`
	`13`	+For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), we can create three (3) partitions with the following bounds:
	`14`	`+`
	`15`	`+* partition 1 will contain all dates prior to January 1, 2000,`
	`16`	`+* partition 2 will contain all dates between January 1, 2000 and December 31, 2020,`
	`17`	`+* partition 3 will contain all dates after January 1, 2021.`
	`18`	`+`
	`19`	`+While these ranges are not even, we chose them because of some knowledge we have about our dataset. In our hypothetical example, we know that these date ranges will split our dataset into roughly three (3) evenly sized tables.`
	`20`	`+`
	`21`	`+####Building partitions`
	`22`	`+`
	`23`	`+Let's build some real partitions with a dataset from Kaggle:[Hourly Energy Consumption](https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption).`
	`24`	`+`
	`25`	`+You can create a partition by range in Postgres with just a few queries. Partitioning requires two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.`
	`26`	`+`
	`27`	`+Let's start with the parent table:`
	`28`	`+`
	`29`	+```sql
	`30`	`+CREATETABLEenergy_consumption (`
	`31`	`+"Datetime"TIMESTAMPTZ,`
	`32`	`+"AEP_MW"REAL`
	`33`	`+) PARTITION BY RANGE("Datetime");`
	`34`	+```
	`35`	`+`
	`36`	`+Now, let's add a couple child tables:`
	`37`	`+`
	`38`	+```sql
	`39`	`+CREATETABLEenergy_consumption_2004_2011`
	`40`	`+PARTITION OF energy_consumption`
	`41`	`+FORVALUESFROM ('2004-01-01') TO ('2011-12-31');`
	`42`	`+`
	`43`	`+CREATETABLEenergy_consumption_2012_2018`
	`44`	`+PARTITION OF energy_consumption`
	`45`	`+FORVALUESFROM ('2011-12-31') TO ('2018-12-31');`
	`46`	+```
	`47`	`+`
	`48`	+Postgres partition bounds are defined as`[start, end)`, which means the start of the range is included and the end of the range is excluded.
	`49`	`+`
	`50`	`+Let's ingest the dataset into our partitioned table and see what we get:`
	`51`	`+`
	`52`	+```
	`53`	`+postgresml=# \copy energy_consumption FROM 'AEP_hourly.csv' CSV HEADER;`
	`54`	`+COPY 121273`
	`55`	+```
	`56`	`+`
	`57`	`+We have a grand total of 121,273 rows. If we partitioned the dataset correctly, the two child tables should have roughly the same number of rows:`
	`58`	`+`
	`59`	+```
	`60`	`+postgresml=# SELECT count(*) FROM energy_consumption_2004_2011;`
	`61`	`+ count`
	`62`	`+-------`
	`63`	`+ 63511`
	`64`	`+`
	`65`	`+ postgresml=# SELECT count(*) FROM energy_consumption_2012_2018;`
	`66`	`+ count`
	`67`	`+-------`
	`68`	`+ 57762`
	`69`	+```
	`70`	`+`
	`71`	`+Nicely done. The two tables are pretty close to each other which creates a roughly even distribution of data in our partitioning scheme.`
	`72`	`+`
	`73`	`+Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, we could continue to add partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.`
	`74`	`+`
	`75`	`+To make this user friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:`
	`76`	`+`
	`77`	+```sql
	`78`	`+SELECT`
	`79`	`+avg("AEP_MW")`
	`80`	`+FROM energy_consumption`
	`81`	`+WHERE"Datetime" BETWEEN'2004-01-01'AND'2005-01-01';`
	`82`	+```
	`83`	`+`
	`84`	+```
	`85`	`+ avg`
	`86`	`+--------------------`
	`87`	`+ 15175.689170820118`
	`88`	+```
	`89`	`+`
	`90`	`+If we look at the query plan, we'll see that Postgres only queries the first child table we created:`
	`91`	`+`
	`92`	+```
	`93`	`+postgresml=# EXPLAIN SELECT`
	`94`	`+ avg("AEP_MW")`
	`95`	`+FROM energy_consumption`
	`96`	`+WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';`
	`97`	`+`
	`98`	`+ QUERY PLAN`
	`99`	`+----------------------------------------------------------------------------`
	`100`	`+ Aggregate (cost=10000001302.18..10000001302.19 rows=1 width=8)`
	`101`	`+ -> Seq Scan on energy_consumption_2004_2011 energy_consumption (...)`
	`102`	`+ Filter: [...]`
	`103`	+```
	`104`	`+`
	`105`	`+which reduces the number of rows it has to scan by half. By adding more smaller partitions, we can significantly reduce the amount of data Postgres needs to scan to execute a query. That being said, scanning multiple tables could be more expensive than scanning just one table, so adding too many partitions quickly reduces its benefits if the queries need to scan more than a few child tables.`
	`106`	`+`
	`107`	`+###Partition by hash`
	`108`	`+`
	`109`	`+Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is applied to the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md`

Lines changed: 7 additions & 9 deletions

Original file line number	Diff line number	Diff line change
`@@ -176,11 +176,11 @@ Postgres supports many more indexing algorithms, e.g. GiST, BRIN, GIN, and Hash.`
`176`	`176`
`177`	`177`	`###Accelerating recall`
`178`	`178`
`179`		`-Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow.To help withthat, tables should have indexes that order and organize commonly read columns. Searchinga B-Treeindex can be done in_O(log n)_ time, which is orders of magnitude faster than the_O(n)_ full tablesearch.`
	`179`	`+Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow.Atthat point, tables should have indexes that order and organize commonly read columns. Searchinganindex can be done in_O(log n)_ time, which is orders of magnitude faster than the_O(n)_ full tablescan.`
`180`	`180`
`181`	`181`	`####Querying an index`
`182`	`182`
`183`		`-Postgres automatically uses indexes when its possible and optimal to do so. From our example, wecanfilter the dataset by the "Address" column,and we can do so very quickly because of the index we created:`
	`183`	`+Postgres automatically uses indexes when its possible and optimal to do so. From our example,ifwe filter the dataset by the "Address" column,Postgres will use the index we created and return a result quickly:`
`184`	`184`
`185`	`185`	```sql
`186`	`186`	`SELECT`
`@@ -199,9 +199,9 @@ which produces`
`199`	`199`	`(1 row)`
`200`	`200`	```
`201`	`201`
`202`		`-Since we have a unique index on the table, wewill only see one row with that address.`
	`202`	`+Since we have a unique index on the table, weexpect to see only one row with that address.`
`203`	`203`
`204`		-To double check that Postgres is using an index, we cancheckthe query execution plan. A query plan is a list of steps that Postgres will take to get thequeryresultwe requested. To see the query plan, prepend the keyword`EXPLAIN` to the query you're running:
	`204`	+To double check that Postgres is using an index, we cantake a look atthe query execution plan. A query plan is a list of steps that Postgres will take to get the resultof the query. To see the query plan, prepend the keyword`EXPLAIN` to the query you'd like to run:
`205`	`205`
`206`	`206`	```
`207`	`207`	`postgresml=# EXPLAIN (FORMAT JSON) SELECT`
`@@ -232,12 +232,10 @@ WHERE "Address" = '1 Infinite Loop, Cupertino, California';`
`232`	`232`	`]`
`233`	`233`	```
`234`	`234`
`235`		-The plan indicates it will use an "Index Scan" onthe index`usa_house_prices_Address_index` which is what we're expecting.
	`235`	+The plan indicatesthatit will use an "Index Scan" on`usa_house_prices_Address_index` which is what we're expecting. Using`EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
`236`	`236`
`237`		-Using`EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
`238`		`-`
`239`		`-The ability to create indexes on datasets of any size, and to then efficiently query that data with those indexes is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.`
	`237`	`+The ability to create indexes on datasets of any size, and to efficiently query that data using indexes, is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.`
`240`	`238`
`241`	`239`	`####Maintaining an index`
`242`	`240`
`243`		`-Postgres indexes require no special maintenance. They are automatically updated when data is addedor removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant. The database guarantees that the data is always consistent, no matter how many concurrent changes are made.`
	`241`	`+Postgres indexes require no special maintenance. They are automatically updated when data is addedand removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant: the database guarantees that the data is always consistent, no matter how many concurrent changes are made.`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitc10bec2

File tree

5 files changed

5 files changed

`‎pgml-docs/docs/guides/SUMMARY.md`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/documents.md`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md`

0 commit comments