Commit5b69235

Lev Kokotov

authored and

gitbook-bot

committed

GITBOOK-64: change request with no subject merged in GitBook

1 parentd59367e commit5b69235Copy full SHA for 5b69235

File tree

5 files changed

+263

-34

lines changed

pgml-docs/docs/guides
- SUMMARY.md
- data-storage-and-retrieval
- deploying-postgresml/self-hosting
  - README.md

5 files changed

+263

-34

lines changed

`‎pgml-docs/docs/guides/SUMMARY.md‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -52,6 +52,7 @@`
`52`	`52`	`*[Data Storage & Retrieval](data-storage-and-retrieval/README.md)`
`53`	`53`	`*[Tabular data](data-storage-and-retrieval/tabular-data.md)`
`54`	`54`	`*[Vectors](data-storage-and-retrieval/vectors.md)`
	`55`	`+*[Partitioning](data-storage-and-retrieval/partitioning.md)`
`55`	`56`	`*[Deploying PostgresML](deploying-postgresml/README.md)`
`56`	`57`	`*[PostgresML Cloud](deploying-postgresml/postgresml-cloud/README.md)`
`57`	`58`	`*[Plans](deploying-postgresml/postgresml-cloud/plans/README.md)`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md‎`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+#Partitioning`
	`2`	`+`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md‎`

Lines changed: 60 additions & 34 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,16 +1,14 @@`
`1`	`1`	`#Tabular data`
`2`	`2`
`3`		`-Tabular data is data stored in tables.While that's a bit of a recursive definition, tabular data is any kind forformat that defines rows and columns, and is the most common type of datastorage mechanism. Examples of tabular datainclude things like spreadsheets, database tables, CSV files, and Pandas dataframes.`
	`3`	`+Tabular data is data stored in tables.A table is aformat that defines rows and columns, and is the most common type of dataorganization mechanism. Examples of tabular dataare spreadsheets, database tables, CSV files, and Pandas dataframes.`
`4`	`4`
`5`		`-Storing and accessing tabular data is a subject ofdecades of studies, and is the core purpose ofmany database systems. PostgreSQL has been leading the charge on optimal tabular storage for awhileand remainstodayone of the most popular and effective ways to store, organize and retrievethis kind ofdata.`
	`5`	`+Storing and accessing tabular datain an efficient manneris a subject ofmultiple decade-long studies, and is the core purpose ofmost database systems. PostgreSQL has been leading the charge on optimal tabular storage for along time,and remains one of the most popular and effective ways to store, organize and retrievetabulardata today.`
`6`	`6`
`7`	`7`	`###Creating tables`
`8`	`8`
`9`	`9`	`Postgres makes it really easy to create and use tables. If you're looking to use PostgresML for a supervised learning project, creating a table will be very similar to a Pandas dataframe, except it will be durable and easily accessible for as long as the database exists.`
`10`	`10`
`11`		`-For the rest of this guide, we'll take the[USA House Prices](https://www.kaggle.com/code/fatmakursun/supervised-unsupervised-learning-examples/) dataset from Kaggle, store it in Postgres and query it for basic statistics. The dataset has seven (7) columns and 5,000 rows:`
`12`		`-`
`13`		`-`
	`11`	`+For the rest of this guide, we'll take the[USA House Prices](https://www.kaggle.com/code/fatmakursun/supervised-unsupervised-learning-examples/) dataset from Kaggle, store it in a Postgres table and run some basic queries. The dataset has seven (7) columns and 5,000 rows:`
`14`	`12`
`15`	`13`	`\| Column\| Data type\| Postgres data type\|`
`16`	`14`	`\| ----------------------------\| ---------\| ------------------\|`
`@@ -22,7 +20,7 @@ For the rest of this guide, we'll take the [USA House Prices](https://www.kaggle`
`22`	`20`	`\| Price\| Float\| REAL\|`
`23`	`21`	`\| Address\| String\| VARCHAR\|`
`24`	`22`
`25`		`-Once we know the column names and data types, the Postgres table definitionalmost writes itself:`
	`23`	`+Once we know the column names and data types, the Postgres table definitionis pretty straight forward:`
`26`	`24`
`27`	`25`	```plsql
`28`	`26`	`CREATE TABLE usa_house_prices (`
`@@ -36,7 +34,7 @@ CREATE TABLE usa_house_prices (`
`36`	`34`	`);`
`37`	`35`	```
`38`	`36`
`39`		-The column names are double quoted because they contain special characters like`.` and space, which can be interpreted to be part of the SQL syntax. Generally speaking, it's good practice to double quote all entity names when using them in aPostgreSQLquery, although most of the time it's not needed.
	`37`	+The column names are double quoted because they contain special characters like`.` and space, which can be interpreted to be part of the SQL syntax. Generally speaking, it's good practice to double quote all entity names when using them in a query, although most of the time it's not needed.
`40`	`38`
`41`	`39`	If you run this using`psql`, you'll get something like this:
`42`	`40`
`@@ -56,9 +54,9 @@ postgresml=#`
`56`	`54`
`57`	`55`	`###Ingesting data`
`58`	`56`
`59`		-Right now the table is emptyand that's a bit boring. Let's import the USA House Prices datasetinto itusing one of theeasiest andfastest way to do so in Postgres:using`COPY`.
	`57`	+Right now the table is emptywhich is a bit boring. Let's import the USA House Prices dataset using one of the fastest way to do so in Postgres:with`COPY`.
`60`	`58`
`61`		-If you're like me and prefer to use the terminal, you can open up`psql` and ingest thedataset like this:
	`59`	+If you're like me and prefer to use the terminal, you can open up`psql` and ingest thedata like this:
`62`	`60`
`63`	`61`	```
`64`	`62`	`postgresml=# \copy usa_house_prices FROM 'USA_Housing.csv' CSV HEADER;`
`@@ -67,13 +65,13 @@ COPY 5000`
`67`	`65`
`68`	`66`	As expected, Postgres copied all 5,000 rows into the`usa_house_prices` table.`COPY` accepts CSV, text, and Postgres binary formats, but CSV is definitely the most common.
`69`	`67`
`70`		-You may have noticed that we used the`\copy` command in the terminal, not`COPY`. The`COPY` command actually comes in two forms:`\copy` which is a`psql` command thatperforms a local system to remotedatabase server copy, and`COPY`whichis more commonly used in applications. If you're writing your own application to ingest data into Postgres, you'll be using`COPY`.
	`68`	+You may have noticed that we used the`\copy` command in the terminal, not`COPY`. The`COPY` command actually comes in two forms:`\copy` which is a`psql` command thatcopies data from systemfilesto remotedatabases, while`COPY` is more commonly used in applications to send data from other sources, like standard input, files, other databases and streams.
`71`	`69`
`72`		`-###Queryingdata`
	`70`	+If you're writing your own application to ingest large amounts ofdata into Postgres, you should use`COPY` for maximum throughput.
`73`	`71`
`74`		`-Querying data stored in tables is what this is all about. After all, just storing data isn't particularly interesting or useful. Postgres has one of the most comprehensive and powerful querying languages of all data storage systems we've worked with so, for our example, we won't have any trouble calculating some statistics to understand our data better.`
	`72`	`+###Querying data`
`75`	`73`
`76`		`-Let's compute some basic statistics on the "Avg. Area Income" column using SQL:`
	`74`	`+Querying data stored in tables is what makes PostgresML so powerful. Postgres has one of the most comprehensive querying languages of all databases we've worked with so, for our example, we won't have any trouble calculating some statistics:`
`77`	`75`
`78`	`76`	```sql
`79`	`77`	`SELECT`
`@@ -87,19 +85,17 @@ SELECT`
`87`	`85`	`FROM usa_house_prices;`
`88`	`86`	```
`89`	`87`
`90`		`-which produces exactly what we want:`
`91`		`-`
`92`	`88`	```
`93`	`89`	`count \| avg \| max \| min \| percentile_75 \| stddev`
`94`	`90`	`-------+-------------------+-----------+----------+----------------+-------------------`
`95`	`91`	`5000 \| 68583.10897773437 \| 107701.75 \| 17796.63 \| 75783.33984375 \| 10657.99120344229`
`96`	`92`	```
`97`	`93`
`98`		`-The SQL language isveryexpressive and allows to select, filter and aggregate any number of columns from any number of tables with a single query.`
	`94`	`+The SQL language is expressive and allows to select, filter and aggregate any number of columns with a single query.`
`99`	`95`
`100`	`96`	`###Adding more data`
`101`	`97`
`102`		`-Because databases store datain perpetuity, adding more data to Postgres cantake several forms. The simplest and most commonly used wayto add datais to just insert it into a tablethat wealready have. Using theUSA House Pricesexample, we can add a new row into the table with just one query:`
	`98`	`+Because databases store datapermanently, adding more data to Postgres canbe done in many ways. The simplest and most commonly used way is to just insert it into a tableyoualready have. Using thesameexample dataset, we can add a new row with just one query:`
`103`	`99`
`104`	`100`	```sql
`105`	`101`	`INSERT INTO usa_house_prices (`
`@@ -121,40 +117,70 @@ INSERT INTO usa_house_prices (`
`121`	`117`	`);`
`122`	`118`	```
`123`	`119`
`124`		-Another way to add more data to a table is to run`COPY` again with a different CSV as the source. Many ETL pipelines from places like Snowflake or Redshift split their output into multiple CSVs, which can be individually imported into Postgres using multiple`COPY` statements.
	`120`	+If you have more CSV files you'd like to ingest, you can run`COPY` for each one. Many ETL pipelines from Snowflake or Redshift chunk their output into multiple CSVs, which can be individually imported into Postgres using`COPY`.
	`121`	`+`
	`122`	`+{% tabs %}`
	`123`	`+{% tab title="Python" %}`
	`124`	+```python
	`125`	`+import psycopg`
	`126`	`+from globimport glob`
	`127`	`+`
	`128`	`+with psycopg.connect("postgres:///postgresml")as conn:`
	`129`	`+ cur= conn.cursor()`
	`130`	`+`
	`131`	`+with cur.copy("COPY usa_house_prices FROM STDIN CSV")as copy:`
	`132`	`+for csv_filein glob("*.csv"):`
	`133`	`+withopen(csv_file)as f:`
	`134`	`+next(f)# Skip header`
	`135`	`+for linein f:`
	`136`	`+ copy.write(line)`
	`137`	+```
	`138`	`+{% endtab %}`
	`139`	`+`
	`140`	`+{% tab title="Bash" %}`
	`141`	+```bash
	`142`	`+#!/bin/bash`
`125`	`143`
`126`		`-Adding rows is pretty simple, but now that our dataset is changing, we should explore some tools to help us protect it against bad values.`
	`144`	`+forfin$(ls*.csv);do`
	`145`	`+ psql postgres:///postgresml \`
	`146`	`+ -c"\copy usa_house_prices FROM '$f' CSV HEADER"`
	`147`	`+done`
	`148`	+```
	`149`	`+{% endtab %}`
	`150`	`+{% endtabs %}`
	`151`	`+`
	`152`	`+Now that our dataset is changing, we should explore some tools to protect it against bad values.`
`127`	`153`
`128`	`154`	`###Data integrity`
`129`	`155`
`130`		`-Databases storeveryimportant dataand they were built with many safety features to protectthat datafrom common errors. In machine learning, one of the most common errors is data duplication, i.e. having the same row appear in the a table twice. Postgres can easily protect us against this with unique indexes.`
	`156`	`+Databases store important dataso they were built with many safety featuresin mindto protect from common errors. In machine learning, one of the most common errors is data duplication, i.e. having the same row appear in the a table twice. Postgres can protect us against this with unique indexes.`
`131`	`157`
`132`		`-Looking at the USA House Price dataset, we can find its natural key pretty easily. Since most columns are aggregates, the only column that seems unique is the "Address".After all, thereshould never be more than one house at a single address, not for sale anyway.`
	`158`	`+Looking at the USA House Price dataset, we can find its natural key pretty easily. Since most columns are aggregates, the only column that seemslike it should containuniquevaluesis the "Address".Thereshould never be more than one housefor saleat a single address.`
`133`	`159`
`134`		`-To ensure that ourdataset reflects this, let's add a unique index to our table. To do so, we can use this SQL query:`
	`160`	`+To ensure that ourtable reflects this, let's add a unique index:`
`135`	`161`
`136`	`162`	```sql
`137`	`163`	`CREATEUNIQUE INDEXON usa_house_prices USING btree("Address");`
`138`	`164`	```
`139`	`165`
`140`		`-Postgres scans the whole table,ensures there are no duplicates in the "Address" column andcreates an index on that column using the B-Tree algorithm.`
	`166`	`+When creating a unique index,Postgres scans the whole table,ensuring there are no duplicates in the "Address" column, andwrites the column into an index using the B-Tree algorithm.`
`141`	`167`
`142`		`-If wenowattempt to insert the same row again, we'll get an error:`
	`168`	`+If we attempt to insert the same row again,nowwe're getting an error:`
`143`	`169`
`144`	`170`	```
`145`	`171`	`ERROR: duplicate key value violates unique constraint "usa_house_prices_Address_idx"`
`146`	`172`	`DETAIL: Key ("Address")=(1 Infinite Loop, Cupertino, California) already exists.`
`147`	`173`	```
`148`	`174`
`149`		-Postgres supports many more indexing algorithms,namely GiST, BRIN, GIN, and Hash. Many extensions,for example`pgvector`, implement their own index types like HNSW and IVFFlat,to efficiently search and retrieve specialized values. We explore those in our guide about[Vectors](vectors.md).
	`175`	+Postgres supports many more indexing algorithms,e.g. GiST, BRIN, GIN, and Hash. Many extensions,e.g.`pgvector`, implement their own index types like HNSW and IVFFlat,which help efficiently retrieve specialized values. We explore those in our guide about[Vectors](vectors.md).
`150`	`176`
`151`	`177`	`###Accelerating recall`
`152`	`178`
`153`		`-Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scanto return a resultbecomesquitelarge and queries become slow. To help with that, tables should have indexes that order and organize commonlyaccessed columns.Scanning a B-Tree index can be done in_O(log n)_ time, which is orders of magnitude faster than the_O(n)_ full table search.`
	`179`	`+Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow. To help with that, tables should have indexes that order and organize commonlyread columns.Searching a B-Tree index can be done in_O(log n)_ time, which is orders of magnitude faster than the_O(n)_ full table search.`
`154`	`180`
`155`	`181`	`####Querying an index`
`156`	`182`
`157`		`-Postgres automatically uses indexes when possiblein order toaccelerate recall. Using our example above, we canquery data usingthe "Address" column and we can do so very quicklyby using theuniqueindex we created.`
	`183`	`+Postgres automatically uses indexes whenitspossibleand optimal todo so. From our example, we canfilter the dataset bythe"Address" column, and we can do so very quicklybecause of the index we created:`
`158`	`184`
`159`	`185`	```sql
`160`	`186`	`SELECT`
`@@ -173,11 +199,9 @@ which produces`
`173`	`199`	`(1 row)`
`174`	`200`	```
`175`	`201`
`176`		`-which is exactly what we expected. Since we have a unique index on the table, we should only be getting one row back with that address.`
`177`		`-`
`178`		`-To ensure that Postgres is using an index when querying a table, we can ask it to produce the query execution plan that it's going to use before executing that query. A query plan is a list of steps that Postgres will take in order to get the query result we requested.`
	`202`	`+Since we have a unique index on the table, we will only see one row with that address.`
`179`	`203`
`180`		-Togetthe query plan for anyquery, prepend the keyword`EXPLAIN` toany query you're planning on running:
	`204`	+Todouble check that Postgres is using an index, we can checkthe queryexecutionplan. A query plan is a list of steps that Postgres will take to get thequery result we requested. To see the query plan, prepend the keyword`EXPLAIN` tothe query you're running:
`181`	`205`
`182`	`206`	```
`183`	`207`	`postgresml=# EXPLAIN (FORMAT JSON) SELECT`
`@@ -208,10 +232,12 @@ WHERE "Address" = '1 Infinite Loop, Cupertino, California';`
`208`	`232`	`]`
`209`	`233`	```
`210`	`234`
`211`		-The query plan indicates that it will be running an "Index Scan" using the index`usa_house_prices_Address_index` which is exactly what we want.
	`235`	+The plan indicates it will use an "Index Scan" on the index`usa_house_prices_Address_index` which is what we're expecting.
	`236`	`+`
	`237`	+Using`EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
`212`	`238`
`213`		`-The ability to create indexes on datasets of any size and to then efficiently query that data is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and querydatasets that would neverbe able tofitinto memory and can doso quicker and more efficiently than most database systemscurrentlyusedacross the industry.`
	`239`	`+The ability to create indexes on datasets of any size, and to then efficiently query that datawith those indexesis what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and querydata that would never fitin memory, anditcan dothat quicker and more efficiently than mostotherdatabase systems usedin the industry.`
`214`	`240`
`215`	`241`	`####Maintaining an index`
`216`	`242`
`217`		`-Indexesare automatically updated whennewdata is addedand old data isremoved. Postgresautomatically ensures that indexes are efficiently organized and are ACID compliant.When using Postgres tables, the systemguarantees that the datawill alwaysbeconsistent, no matter how many concurrent changes are made to the tables.`
	`243`	`+Postgres indexes require no special maintenance. Theyare automatically updated when data is addedorremoved. Postgresalso ensures that indexes are efficiently organized and are ACID compliant.The databaseguarantees that the datais always consistent, no matter how many concurrent changes are made.`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit5b69235

File tree

5 files changed

5 files changed

`‎pgml-docs/docs/guides/SUMMARY.md‎`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/partitioning.md‎`

`‎pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md‎`

0 commit comments