Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commita38c1a9

Browse files
Lev Kokotovgitbook-bot
Lev Kokotov
authored andcommitted
GITBOOK-66: Edits to tabluar data, vectors and partitioning vectors
1 parentc10bec2 commita38c1a9

File tree

3 files changed

+224
-34
lines changed

3 files changed

+224
-34
lines changed

‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md

Lines changed: 201 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
#Partitioning
22

3-
Partitioning isthe act ofsplitting Posgres tables into multiple smaller tables,which allows to queryeach smaller table independently. This is useful and sometimes necessary when tables get so large thatquerying a single tablebecomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres executefasterqueries.
3+
Partitioning is splitting Posgres tables into multiple smaller tables,with the intention of queryingeach smaller table independently. This is useful and sometimes necessary when tables get so large thataccessing thembecomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres execute queries faster.
44

55
###Partitioning schemes
66

7-
Postgres supports three (3) kinds ofpartitioning schemes: by range, bylist, and byhash. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.
7+
Postgres supports three (3) kinds ofpartition schemes: by range, byhash, and bylist. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.
88

99
###Partition by range
1010

1111
Partition by range operates on numerical values. Dates, numbers and vectors can be used as range partition keys because their range of values can be split into non-overlapping parts.
1212

13-
For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), wecan create three (3) partitions with the following bounds:
13+
For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), wecould create three (3) partitions with the following bounds:
1414

1515
* partition 1 will contain all dates prior to January 1, 2000,
16-
* partition 2 will contain all dates between January 1, 2000 andDecember 31, 2020,
16+
* partition 2 will contain all dates between January 1, 2000 andJanuary 1, 2021,
1717
* partition 3 will contain all dates after January 1, 2021.
1818

1919
While these ranges are not even, we chose them because of some knowledge we have about our dataset. In our hypothetical example, we know that these date ranges will split our dataset into roughly three (3) evenly sized tables.
@@ -22,7 +22,7 @@ While these ranges are not even, we chose them because of some knowledge we have
2222

2323
Let's build some real partitions with a dataset from Kaggle:[Hourly Energy Consumption](https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption).
2424

25-
Youcan create a partition by rangein Postgreswith just a few queries. Partitioning requires two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.
25+
In Postgres, youcan create a partition by range with just a few queries. Partitioning requires creating two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.
2626

2727
Let's start with the parent table:
2828

@@ -68,11 +68,11 @@ postgresml=# SELECT count(*) FROM energy_consumption_2004_2011;
6868
57762
6969
```
7070

71-
Nicely done. The twotablesare pretty close to each other which creates a roughly even distribution of data in our partitioning scheme.
71+
Nicely done. The twotable countsare pretty close, which creates a roughly even distribution of data in our partitioning scheme.
7272

73-
Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, wecouldcontinueto add partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.
73+
Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, wewould needcontinueadding partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.
7474

75-
To make thisuserfriendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:
75+
To makereadingthisdata user-friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:
7676

7777
```sql
7878
SELECT
@@ -87,7 +87,7 @@ WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';
8787
15175.689170820118
8888
```
8989

90-
If we look at the query plan, we'll see that Postgres only queriesthe firstchildtable we created:
90+
If we look at the query plan, we'll see that Postgres only queriesone of thechildtables we created:
9191

9292
```
9393
postgresml=# EXPLAIN SELECT
@@ -102,8 +102,198 @@ WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';
102102
Filter: [...]
103103
```
104104

105-
which reduces the number of rowsit has to scan by half. By adding moresmallerpartitions, we can significantly reduce the amount of data Postgres needs to scan toexecute a query. That being said, scanning multiple tables could be more expensive than scanning just one table, so adding too many partitions quickly reduces its benefits if the queries need to scan more than a few child tables.
105+
This reduces the number of rowsPostgres has to scan by half. By adding more partitions, we can significantly reduce the amount of data Postgres needs to scan toperform a query.
106106

107107
###Partition by hash
108108

109-
Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is applied to the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.
109+
Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is executed on the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.
110+
111+
To create a table partitioned by hash, the syntax is similar to partition by range. Let's use the USA House Prices dataset we used in[Vectors](vectors.md)and[Tabular data](tabular-data.md), and split that table into two (2) roughly equal parts. Since we already have the`usa_house_prices` table, let's create a new one with the same columns, except this one will be partitioned:
112+
113+
```sql
114+
CREATETABLEusa_house_prices_partitioned (
115+
"Avg. Area Income"REALNOT NULL,
116+
"Avg. Area House Age"REALNOT NULL,
117+
"Avg. Area Number of Rooms"REALNOT NULL,
118+
"Avg. Area Number of Bedrooms"REALNOT NULL,
119+
"Area Population"REALNOT NULL,
120+
"Price"REALNOT NULL,
121+
"Address"VARCHARNOT NULL
122+
) PARTITION BY HASH("Address");
123+
```
124+
125+
Let's add two (2) partitions by hash. Hashing uses modulo arithmetic; when creating a child data table with these scheme, you need to specify the denominator and the remainder:
126+
127+
```sql
128+
CREATETABLEusa_house_prices_partitioned_1
129+
PARTITION OF usa_house_prices_partitioned
130+
FORVALUES WITH (modulus2, remainder0);
131+
132+
CREATETABLEusa_house_prices_partitioned_1
133+
PARTITION OF usa_house_prices_partitioned
134+
FORVALUES WITH (modulus2, remainder1);
135+
```
136+
137+
Importing data into the new table can be done with just one query:
138+
139+
```sql
140+
INSERT INTO usa_house_prices_partitioned
141+
SELECT*FROM usa_houses_prices;
142+
```
143+
144+
```
145+
INSERT 0 5000
146+
```
147+
148+
Let's validate that our partitioning scheme worked:
149+
150+
```
151+
postgresml=# SELECT count(*) FROM usa_house_prices_partitioned_1;
152+
count
153+
-------
154+
2528
155+
(1 row)
156+
157+
postgresml=# SELECT count(*) FROM usa_house_prices_partitioned_2;
158+
count
159+
-------
160+
2472
161+
(1 row)
162+
```
163+
164+
Great! As expected, hashing split our dataset into roughly equal parts. To take advantage of this when reading data, you need to specify the partition key "Address" in every query. Postgres will hash the key using the same hashing function and query the child table that can contain the row with the "Address" value:
165+
166+
```
167+
postgresml=# EXPLAIN SELECT
168+
"Avg. Area House Age",
169+
"Address"
170+
FROM usa_house_prices_partitioned
171+
WHERE "Address" = '1 Infinite Loop, Cupertino, California';
172+
QUERY PLAN
173+
-------------------------------------------------------------------------------------------------------------
174+
Seq Scan on usa_house_prices_partitioned_1 usa_house_prices_partitioned (cost=0.00..63.60 rows=1 width=51)
175+
Filter: (("Address")::text = '1 Infinite Loop, Cupertino, California'::text)
176+
```
177+
178+
###Partitioning vectors
179+
180+
When discussing[Vectors](vectors.md), we mentioned that HNSW indexes slow down table inserts as the table grows over time. Partitioning is a great tool to help us scale vector indexes used for ANN search.
181+
182+
For this example, we'll be using a section of the[Amazon Reviews](https://cseweb.ucsd.edu/\~jmcauley/datasets.html#amazon\_reviews) dataset that we've embedded using the`intloat/e5-large` embeddings model. Our subset of the data contains 250,000 rows and two columns:
183+
184+
| Column| Data type| Example|
185+
| ---------------------------| --------------| --------------------------------------------|
186+
|`review_body`|`VARCHAR`|`It was great`|
187+
|`review_embedding_e5_large`|`VECTOR(1024)`|`[-0.11999297,-1.5099727,-0.102814615, ...]`|
188+
189+
You can[download](https://static.postgresml.org/datasets/amazon\_reviews\_with\_embeddings.csv.gz) this dataset in CSV format from our CDN. To unzip it, install`pigz` and run:
190+
191+
```bash
192+
unpigz amazon_reviews_with_embeddings.csv.gz
193+
```
194+
195+
####Creating partitions
196+
197+
Let's get started by creating a partitioned table with three (3) child partitions. We'll be using hash partitioning on the`review_body` column which should produce three (3) roughly equally sized tables.
198+
199+
```sql
200+
CREATETABLEamazon_reviews_with_embedding (
201+
review_bodyTEXT,
202+
review_embedding_e5_large VECTOR(1024)
203+
) PARTITION BY HASH(review_body);
204+
205+
CREATETABLEamazon_reviews_with_embedding_1
206+
PARTITION OF amazon_reviews_with_embedding
207+
FORVALUES WITH (modulus3, remainder0);
208+
209+
CREATETABLEamazon_reviews_with_embedding_2
210+
PARTITION OF amazon_reviews_with_embedding
211+
FORVALUES WITH (modulus3, remainder1);
212+
213+
CREATETABLEamazon_reviews_with_embedding_3
214+
PARTITION OF amazon_reviews_with_embedding
215+
FORVALUES WITH (modulus3, remainder2);
216+
```
217+
218+
This creates a total of four (4) tables: one parent table defining the schema and three (3) child tables that will contain the review text and the embeddings vectors. To import data into the tables, you can use`COPY`:
219+
220+
```
221+
postgresml=# \copy
222+
amazon_reviews_with_embedding FROM 'amazon_reviews_with_embeddings.csv'
223+
CSV HEADER;
224+
COPY 250000
225+
```
226+
227+
####Indexing vectors
228+
229+
Now that we've split our 250,000 vectors into three (3) tables, we can create an HNSW index on each partition independently. This allows us to shard the index into three (3) equally sized parts and because Postgres is a database server, we can do so in parallel.
230+
231+
If you're doing this with`psql`, open up three (3) terminal tabs, connect to your PostgresML database and create an index on each partition separately:
232+
233+
{% tabs %}
234+
{% tab title="Tab 1" %}
235+
```sql
236+
SET maintenance_work_mem TO'2GB';
237+
238+
CREATEINDEXON
239+
amazon_reviews_with_embedding_1
240+
USING hnsw(review_embedding_e5_large vector_cosine_ops);
241+
```
242+
{% endtab %}
243+
244+
{% tab title="Tab 2" %}
245+
```sql
246+
SET maintenance_work_mem TO'2GB';
247+
248+
CREATEINDEXON
249+
amazon_reviews_with_embedding_2
250+
USING hnsw(review_embedding_e5_large vector_cosine_ops);
251+
```
252+
{% endtab %}
253+
254+
{% tab title="Tab 3" %}
255+
```sql
256+
SET maintenance_work_mem TO'2GB';
257+
258+
CREATEINDEXON
259+
amazon_reviews_with_embedding_3
260+
USING hnsw(review_embedding_e5_large vector_cosine_ops);
261+
```
262+
{% endtab %}
263+
{% endtabs %}
264+
265+
This is an example of scaling vector search using partitions. We are increasing our indexing speed 3x because we can create HNSW indexes on separate tables in parallel. Since we have separate indexes for each partition, we are also reducing the size of the HNSW index by 3x, making sure that`INSERT` queries against the data remain sufficiently quick.
266+
267+
####Partitioned vector search
268+
269+
To perform an ANN search using the indexes we created, we don't have to do anything special. Postgres will automatically scan all three (3) indexes for the closest matches and combine them into one result:
270+
271+
```sql
272+
SELECT
273+
review_body,
274+
review_embedding_e5_large<=>pgml.embed(
275+
'intfloat/e5-large',
276+
'this chair was amazing'
277+
)::vector(1024)AS cosine_distance
278+
FROM amazon_reviews_with_embedding
279+
ORDER BY cosine_distance
280+
LIMIT9;
281+
```
282+
283+
```
284+
review_body | cosine_distance
285+
------------------------+---------------------
286+
It was great. | 0.1514577011633712
287+
It was great. | 0.1514577011633712
288+
It was great. | 0.1514577011633712
289+
It was great. | 0.1514577011633712
290+
It was great. | 0.1514577011633712
291+
It was great. | 0.1514577011633712
292+
amazing | 0.17130070002153353
293+
Amazing | 0.17130070002153353
294+
Absolutely phenomenal. | 0.1742546608547857
295+
```
296+
297+
Since scanning HNSW indexes is very quick, we are okay with having to scan all indexes we created for every query. As of this writing,`pgvector` doesn't support partitioning its indexes because this requires splitting the graph in distinct sections. Work on this front will continue and we'll add support for sharding HNSW indexes in the future.
298+
299+
To validate that Postgres is using indexes, prepend`EXPLAIN` to the query. You should see three (3) index scans, one for each partition table.

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp