Commita38c1a9

Lev Kokotov

authored and

gitbook-bot

committed

GITBOOK-66: Edits to tabluar data, vectors and partitioning vectors

1 parentc10bec2 commita38c1a9Copy full SHA for a38c1a9

File tree

3 files changed

+224

-34

lines changed

pgml-docs/docs/guides/data-storage-and-retrieval

3 files changed

+224

-34

lines changed

`‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md`

Lines changed: 201 additions & 11 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,19 +1,19 @@`
`1`	`1`	`#Partitioning`
`2`	`2`
`3`		`-Partitioning isthe act ofsplitting Posgres tables into multiple smaller tables,which allows to queryeach smaller table independently. This is useful and sometimes necessary when tables get so large thatquerying a single tablebecomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres executefasterqueries.`
	`3`	`+Partitioning is splitting Posgres tables into multiple smaller tables,with the intention of queryingeach smaller table independently. This is useful and sometimes necessary when tables get so large thataccessing thembecomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres execute queries faster.`
`4`	`4`
`5`	`5`	`###Partitioning schemes`
`6`	`6`
`7`		`-Postgres supports three (3) kinds ofpartitioning schemes: by range, bylist, and byhash. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.`
	`7`	`+Postgres supports three (3) kinds ofpartition schemes: by range, byhash, and bylist. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.`
`8`	`8`
`9`	`9`	`###Partition by range`
`10`	`10`
`11`	`11`	`Partition by range operates on numerical values. Dates, numbers and vectors can be used as range partition keys because their range of values can be split into non-overlapping parts.`
`12`	`12`
`13`		-For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), wecan create three (3) partitions with the following bounds:
	`13`	+For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), wecould create three (3) partitions with the following bounds:
`14`	`14`
`15`	`15`	`* partition 1 will contain all dates prior to January 1, 2000,`
`16`		`-* partition 2 will contain all dates between January 1, 2000 andDecember 31, 2020,`
	`16`	`+* partition 2 will contain all dates between January 1, 2000 andJanuary 1, 2021,`
`17`	`17`	`* partition 3 will contain all dates after January 1, 2021.`
`18`	`18`
`19`	`19`	`While these ranges are not even, we chose them because of some knowledge we have about our dataset. In our hypothetical example, we know that these date ranges will split our dataset into roughly three (3) evenly sized tables.`
`@@ -22,7 +22,7 @@ While these ranges are not even, we chose them because of some knowledge we have`
`22`	`22`
`23`	`23`	`Let's build some real partitions with a dataset from Kaggle:[Hourly Energy Consumption](https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption).`
`24`	`24`
`25`		`-Youcan create a partition by rangein Postgreswith just a few queries. Partitioning requires two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.`
	`25`	`+In Postgres, youcan create a partition by range with just a few queries. Partitioning requires creating two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.`
`26`	`26`
`27`	`27`	`Let's start with the parent table:`
`28`	`28`
`@@ -68,11 +68,11 @@ postgresml=# SELECT count(*) FROM energy_consumption_2004_2011;`
`68`	`68`	`57762`
`69`	`69`	```
`70`	`70`
`71`		`-Nicely done. The twotablesare pretty close to each other which creates a roughly even distribution of data in our partitioning scheme.`
	`71`	`+Nicely done. The twotable countsare pretty close, which creates a roughly even distribution of data in our partitioning scheme.`
`72`	`72`
`73`		`-Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, wecouldcontinueto add partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.`
	`73`	`+Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, wewould needcontinueadding partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.`
`74`	`74`
`75`		`-To make thisuserfriendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:`
	`75`	`+To makereadingthisdata user-friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:`
`76`	`76`
`77`	`77`	```sql
`78`	`78`	`SELECT`
`@@ -87,7 +87,7 @@ WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';`
`87`	`87`	`15175.689170820118`
`88`	`88`	```
`89`	`89`
`90`		`-If we look at the query plan, we'll see that Postgres only queriesthe firstchildtable we created:`
	`90`	`+If we look at the query plan, we'll see that Postgres only queriesone of thechildtables we created:`
`91`	`91`
`92`	`92`	```
`93`	`93`	`postgresml=# EXPLAIN SELECT`
`@@ -102,8 +102,198 @@ WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';`
`102`	`102`	`Filter: [...]`
`103`	`103`	```
`104`	`104`
`105`		`-which reduces the number of rowsit has to scan by half. By adding moresmallerpartitions, we can significantly reduce the amount of data Postgres needs to scan toexecute a query. That being said, scanning multiple tables could be more expensive than scanning just one table, so adding too many partitions quickly reduces its benefits if the queries need to scan more than a few child tables.`
	`105`	`+This reduces the number of rowsPostgres has to scan by half. By adding more partitions, we can significantly reduce the amount of data Postgres needs to scan toperform a query.`
`106`	`106`
`107`	`107`	`###Partition by hash`
`108`	`108`
`109`		`-Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is applied to the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.`
	`109`	`+Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is executed on the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.`
	`110`	`+`
	`111`	+To create a table partitioned by hash, the syntax is similar to partition by range. Let's use the USA House Prices dataset we used in[Vectors](vectors.md)and[Tabular data](tabular-data.md), and split that table into two (2) roughly equal parts. Since we already have the`usa_house_prices` table, let's create a new one with the same columns, except this one will be partitioned:
	`112`	`+`
	`113`	+```sql
	`114`	`+CREATETABLEusa_house_prices_partitioned (`
	`115`	`+"Avg. Area Income"REALNOT NULL,`
	`116`	`+"Avg. Area House Age"REALNOT NULL,`
	`117`	`+"Avg. Area Number of Rooms"REALNOT NULL,`
	`118`	`+"Avg. Area Number of Bedrooms"REALNOT NULL,`
	`119`	`+"Area Population"REALNOT NULL,`
	`120`	`+"Price"REALNOT NULL,`
	`121`	`+"Address"VARCHARNOT NULL`
	`122`	`+) PARTITION BY HASH("Address");`
	`123`	+```
	`124`	`+`
	`125`	`+Let's add two (2) partitions by hash. Hashing uses modulo arithmetic; when creating a child data table with these scheme, you need to specify the denominator and the remainder:`
	`126`	`+`
	`127`	+```sql
	`128`	`+CREATETABLEusa_house_prices_partitioned_1`
	`129`	`+PARTITION OF usa_house_prices_partitioned`
	`130`	`+FORVALUES WITH (modulus2, remainder0);`
	`131`	`+`
	`132`	`+CREATETABLEusa_house_prices_partitioned_1`
	`133`	`+PARTITION OF usa_house_prices_partitioned`
	`134`	`+FORVALUES WITH (modulus2, remainder1);`
	`135`	+```
	`136`	`+`
	`137`	`+Importing data into the new table can be done with just one query:`
	`138`	`+`
	`139`	+```sql
	`140`	`+INSERT INTO usa_house_prices_partitioned`
	`141`	`+SELECT*FROM usa_houses_prices;`
	`142`	+```
	`143`	`+`
	`144`	+```
	`145`	`+INSERT 0 5000`
	`146`	+```
	`147`	`+`
	`148`	`+Let's validate that our partitioning scheme worked:`
	`149`	`+`
	`150`	+```
	`151`	`+postgresml=# SELECT count(*) FROM usa_house_prices_partitioned_1;`
	`152`	`+ count`
	`153`	`+-------`
	`154`	`+ 2528`
	`155`	`+(1 row)`
	`156`	`+`
	`157`	`+postgresml=# SELECT count(*) FROM usa_house_prices_partitioned_2;`
	`158`	`+ count`
	`159`	`+-------`
	`160`	`+ 2472`
	`161`	`+(1 row)`
	`162`	+```
	`163`	`+`
	`164`	`+Great! As expected, hashing split our dataset into roughly equal parts. To take advantage of this when reading data, you need to specify the partition key "Address" in every query. Postgres will hash the key using the same hashing function and query the child table that can contain the row with the "Address" value:`
	`165`	`+`
	`166`	+```
	`167`	`+postgresml=# EXPLAIN SELECT`
	`168`	`+ "Avg. Area House Age",`
	`169`	`+ "Address"`
	`170`	`+FROM usa_house_prices_partitioned`
	`171`	`+WHERE "Address" = '1 Infinite Loop, Cupertino, California';`
	`172`	`+ QUERY PLAN`
	`173`	`+-------------------------------------------------------------------------------------------------------------`
	`174`	`+ Seq Scan on usa_house_prices_partitioned_1 usa_house_prices_partitioned (cost=0.00..63.60 rows=1 width=51)`
	`175`	`+ Filter: (("Address")::text = '1 Infinite Loop, Cupertino, California'::text)`
	`176`	+```
	`177`	`+`
	`178`	`+###Partitioning vectors`
	`179`	`+`
	`180`	`+When discussing[Vectors](vectors.md), we mentioned that HNSW indexes slow down table inserts as the table grows over time. Partitioning is a great tool to help us scale vector indexes used for ANN search.`
	`181`	`+`
	`182`	+For this example, we'll be using a section of the[Amazon Reviews](https://cseweb.ucsd.edu/\~jmcauley/datasets.html#amazon\_reviews) dataset that we've embedded using the`intloat/e5-large` embeddings model. Our subset of the data contains 250,000 rows and two columns:
	`183`	`+`
	`184`	`+\| Column\| Data type\| Example\|`
	`185`	`+\| ---------------------------\| --------------\| --------------------------------------------\|`
	`186`	+\|`review_body`\|`VARCHAR`\|`It was great`\|
	`187`	+\|`review_embedding_e5_large`\|`VECTOR(1024)`\|`[-0.11999297,-1.5099727,-0.102814615, ...]`\|
	`188`	`+`
	`189`	+You can[download](https://static.postgresml.org/datasets/amazon\_reviews\_with\_embeddings.csv.gz) this dataset in CSV format from our CDN. To unzip it, install`pigz` and run:
	`190`	`+`
	`191`	+```bash
	`192`	`+unpigz amazon_reviews_with_embeddings.csv.gz`
	`193`	+```
	`194`	`+`
	`195`	`+####Creating partitions`
	`196`	`+`
	`197`	+Let's get started by creating a partitioned table with three (3) child partitions. We'll be using hash partitioning on the`review_body` column which should produce three (3) roughly equally sized tables.
	`198`	`+`
	`199`	+```sql
	`200`	`+CREATETABLEamazon_reviews_with_embedding (`
	`201`	`+ review_bodyTEXT,`
	`202`	`+ review_embedding_e5_large VECTOR(1024)`
	`203`	`+) PARTITION BY HASH(review_body);`
	`204`	`+`
	`205`	`+CREATETABLEamazon_reviews_with_embedding_1`
	`206`	`+PARTITION OF amazon_reviews_with_embedding`
	`207`	`+FORVALUES WITH (modulus3, remainder0);`
	`208`	`+`
	`209`	`+CREATETABLEamazon_reviews_with_embedding_2`
	`210`	`+PARTITION OF amazon_reviews_with_embedding`
	`211`	`+FORVALUES WITH (modulus3, remainder1);`
	`212`	`+`
	`213`	`+CREATETABLEamazon_reviews_with_embedding_3`
	`214`	`+PARTITION OF amazon_reviews_with_embedding`
	`215`	`+FORVALUES WITH (modulus3, remainder2);`
	`216`	+```
	`217`	`+`
	`218`	+This creates a total of four (4) tables: one parent table defining the schema and three (3) child tables that will contain the review text and the embeddings vectors. To import data into the tables, you can use`COPY`:
	`219`	`+`
	`220`	+```
	`221`	`+postgresml=# \copy`
	`222`	`+ amazon_reviews_with_embedding FROM 'amazon_reviews_with_embeddings.csv'`
	`223`	`+ CSV HEADER;`
	`224`	`+COPY 250000`
	`225`	+```
	`226`	`+`
	`227`	`+####Indexing vectors`
	`228`	`+`
	`229`	`+Now that we've split our 250,000 vectors into three (3) tables, we can create an HNSW index on each partition independently. This allows us to shard the index into three (3) equally sized parts and because Postgres is a database server, we can do so in parallel.`
	`230`	`+`
	`231`	+If you're doing this with`psql`, open up three (3) terminal tabs, connect to your PostgresML database and create an index on each partition separately:
	`232`	`+`
	`233`	`+{% tabs %}`
	`234`	`+{% tab title="Tab 1" %}`
	`235`	+```sql
	`236`	`+SET maintenance_work_mem TO'2GB';`
	`237`	`+`
	`238`	`+CREATEINDEXON`
	`239`	`+ amazon_reviews_with_embedding_1`
	`240`	`+USING hnsw(review_embedding_e5_large vector_cosine_ops);`
	`241`	+```
	`242`	`+{% endtab %}`
	`243`	`+`
	`244`	`+{% tab title="Tab 2" %}`
	`245`	+```sql
	`246`	`+SET maintenance_work_mem TO'2GB';`
	`247`	`+`
	`248`	`+CREATEINDEXON`
	`249`	`+ amazon_reviews_with_embedding_2`
	`250`	`+USING hnsw(review_embedding_e5_large vector_cosine_ops);`
	`251`	+```
	`252`	`+{% endtab %}`
	`253`	`+`
	`254`	`+{% tab title="Tab 3" %}`
	`255`	+```sql
	`256`	`+SET maintenance_work_mem TO'2GB';`
	`257`	`+`
	`258`	`+CREATEINDEXON`
	`259`	`+ amazon_reviews_with_embedding_3`
	`260`	`+USING hnsw(review_embedding_e5_large vector_cosine_ops);`
	`261`	+```
	`262`	`+{% endtab %}`
	`263`	`+{% endtabs %}`
	`264`	`+`
	`265`	+This is an example of scaling vector search using partitions. We are increasing our indexing speed 3x because we can create HNSW indexes on separate tables in parallel. Since we have separate indexes for each partition, we are also reducing the size of the HNSW index by 3x, making sure that`INSERT` queries against the data remain sufficiently quick.
	`266`	`+`
	`267`	`+####Partitioned vector search`
	`268`	`+`
	`269`	`+To perform an ANN search using the indexes we created, we don't have to do anything special. Postgres will automatically scan all three (3) indexes for the closest matches and combine them into one result:`
	`270`	`+`
	`271`	+```sql
	`272`	`+SELECT`
	`273`	`+ review_body,`
	`274`	`+ review_embedding_e5_large<=>pgml.embed(`
	`275`	`+'intfloat/e5-large',`
	`276`	`+'this chair was amazing'`
	`277`	`+ )::vector(1024)AS cosine_distance`
	`278`	`+FROM amazon_reviews_with_embedding`
	`279`	`+ORDER BY cosine_distance`
	`280`	`+LIMIT9;`
	`281`	+```
	`282`	`+`
	`283`	+```
	`284`	`+ review_body \| cosine_distance`
	`285`	`+------------------------+---------------------`
	`286`	`+ It was great. \| 0.1514577011633712`
	`287`	`+ It was great. \| 0.1514577011633712`
	`288`	`+ It was great. \| 0.1514577011633712`
	`289`	`+ It was great. \| 0.1514577011633712`
	`290`	`+ It was great. \| 0.1514577011633712`
	`291`	`+ It was great. \| 0.1514577011633712`
	`292`	`+ amazing \| 0.17130070002153353`
	`293`	`+ Amazing \| 0.17130070002153353`
	`294`	`+ Absolutely phenomenal. \| 0.1742546608547857`
	`295`	+```
	`296`	`+`
	`297`	+Since scanning HNSW indexes is very quick, we are okay with having to scan all indexes we created for every query. As of this writing,`pgvector` doesn't support partitioning its indexes because this requires splitting the graph in distinct sections. Work on this front will continue and we'll add support for sharding HNSW indexes in the future.
	`298`	`+`
	`299`	+To validate that Postgres is using indexes, prepend`EXPLAIN` to the query. You should see three (3) index scans, one for each partition table.

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commita38c1a9

File tree

3 files changed

3 files changed

`‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md`

0 commit comments