NotificationsYou must be signed in to change notification settings
Fork352
Star6.6k

Commitc728d7e

committed

First version of vector databases

1 parent42a6541 commitc728d7eCopy full SHA for c728d7e

File tree

1 file changed

+67

-1

lines changed

README.md

1 file changed

+67

-1

lines changed

`‎README.md‎`

Lines changed: 67 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -773,7 +773,73 @@ SELECT pgml.transform(`
`773`	`773`	`#Vector Database`
`774`	`774`	A vector database is a type of database that stores and manages vectors, which are mathematical representations of data points in a multi-dimensional space. Vectors can be used to represent a wide range of data types, including images, text, audio, and numerical data. It is designed to support efficient searching and retrieval of vectors, using methods such as nearest neighbor search, clustering, and indexing. These methods enable applications to find vectors that are similar to a given query vector, which is useful for tasks such as image search, recommendation systems, and natural language processing.
`775`	`775`
`776`		-PostgresML allows you to use your existing PostgreSQL database as a vector database by generating embeddings from text stored in the database. To generate embeddings, you can use the`pgml.embed` function, which takes a transformer name and a text value as input. This function automatically downloads and caches the transformer for future reuse, which saves time and resources.
	`776`	+PostgresML enhances your existing PostgreSQL database to be used as a vector database by generating embeddings from text stored in your tables. To generate embeddings, you can use the`pgml.embed` function, which takes a transformer name and a text value as input. This function automatically downloads and caches the transformer for future reuse, which saves time and resources.
	`777`	`+`
	`778`	`+Using a vector database involves three key steps: creating embeddings, indexing your embeddings using different algorithms, and querying the index using embeddings for your queries. Let's break down each step in more detail.`
	`779`	`+`
	`780`	`+##Step 1: Creating embeddings using transformers`
	`781`	`+To create embeddings for your data, you first need to choose a transformer that can generate embeddings from your input data. Some popular transformer options include BERT, GPT-2, and T5. Once you've selected a transformer, you can use it to generate embeddings for your data.`
	`782`	`+`
	`783`	+In the following section, we will demonstrate how to use PostgresML to generate embeddings for a dataset of tweets commonly used in sentiment analysis. To generate the embeddings, we will use the`pgml.embed` function, which will generate an embedding for each tweet in the dataset. These embeddings will then be inserted into a table called tweet_embeddings.
	`784`	+```sql
	`785`	`+SELECTpgml.load_dataset('tweet_eval','sentiment');`
	`786`	`+`
	`787`	`+SELECT*`
	`788`	`+FROMpgml.tweet_eval`
	`789`	`+LIMIT10;`
	`790`	`+`
	`791`	`+CREATETABLEtweet_embeddingsAS`
	`792`	`+SELECTtext,pgml.embed('distilbert-base-uncased',text)AS embedding`
	`793`	`+FROMpgml.tweet_eval;`
	`794`	`+`
	`795`	`+SELECT*from tweet_embeddingslimit2;`
	`796`	+```
	`797`	`+`
	`798`	`+Result`
	`799`	`+`
	`800`	`+\|text\|embedding\|`
	`801`	`+\|----\|---------\|`
	`802`	`+\|"QT@user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"\|{-0.1567948312,-0.3149209619,0.2163394839,..}\|`
	`803`	`+\|"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ"\|{-0.0701668188,-0.012231146,0.1304316372,.. }\|`
	`804`	`+`
	`805`	`+`
	`806`	`+##Step 2: Indexing your embeddings using different algorithms`
	`807`	+After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches.
	`808`	`+`
	`809`	`+On small datasets (<100k rows), a linear search that compares every row to the query will give sub-second results, which may be fast enough for your use case. For larger datasets, you may want to consider various indexing strategies offered by additional extensions.`
	`810`	`+`
	`811`	`+- <ahref="https://www.postgresql.org/docs/current/cube.html"target="_blank">Cube</a> is a built-in extension that provides a fast indexing strategy for finding similar vectors. By default it has an arbitrary limit of 100 dimensions, unless Postgres is compiled with a larger size.`
	`812`	`+- <ahref="https://github.com/pgvector/pgvector"target="_blank">PgVector</a> supports embeddings up to 2000 dimensions out of the box, and provides a fast indexing strategy for finding similar vectors.`
	`813`	`+`
	`814`	`+When indexing your embeddings, it's important to consider the trade-offs between accuracy and speed. Exact indexing algorithms like B-trees can provide precise results, but may not be as fast as approximate indexing algorithms like KNN and ANN. Similarly, some indexing algorithms may require more memory or disk space than others.`
	`815`	`+`
	`816`	`+In the following, we are creating an index on the tweet_embeddings table using the ivfflat algorithm for indexing. The ivfflat algorithm is a type of hybrid index that combines an Inverted File (IVF) index with a Flat (FLAT) index.`
	`817`	`+`
	`818`	+The index is being created on the embedding column in the tweet_embeddings table, which contains vector embeddings generated from the original tweet dataset. The`vector_cosine_ops` argument specifies the indexing operation to use for the embeddings. In this case, it's using the`cosine similarity` operation, which is a common method for measuring similarity between vectors.
	`819`	`+`
	`820`	`+By creating an index on the embedding column, the database can quickly search for and retrieve records that are similar to a given query vector. This can be useful for a variety of machine learning applications, such as similarity search or recommendation systems.`
	`821`	`+`
	`822`	+```sql
	`823`	`+CREATEINDEXON tweet_embeddings USING ivfflat (embedding vector_cosine_ops);`
	`824`	+```
	`825`	`+##Step 3: Querying the index using embeddings for your queries`
	`826`	`+Once your embeddings have been indexed, you can use them to perform queries against your database. To do this, you'll need to provide a query embedding that represents the query you want to perform. The index will then return the closest matching embeddings from your database, based on the similarity between the query embedding and the stored embeddings.`
	`827`	`+`
	`828`	+```sql
	`829`	`+WITH queryAS (`
	`830`	`+SELECTpgml.embed('distilbert-base-uncased','Star Wars christmas special is on Disney')::vectorAS embedding`
	`831`	`+)`
	`832`	`+SELECT*FROM items, queryORDER BYitems.embedding<->query.embeddingLIMIT5;`
	`833`	+```
	`834`	`+`
	`835`	`+Result`
	`836`	`+\|text\|`
	`837`	`+\|----\|`
	`838`	`+\|Happy Friday with Batman animated Series 90S forever!\|`
	`839`	`+\|"Fri Oct 17, Sonic Highways is on HBO tonight, Also new episode of Girl Meets World on Disney"\|`
	`840`	`+\|tfw the 2nd The Hunger Games movie is on Amazon Prime but not the 1st one I didn't watch\|`
	`841`	`+\|5 RT's if you want the next episode of twilight princess tomorrow\|`
	`842`	`+\|Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World -\|`
`777`	`843`
`778`	`844`
`779`	`845`	`<!-- ## Sentence Similarity`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitc728d7e

File tree

1 file changed

1 file changed

`‎README.md‎`

0 commit comments