You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+67-1Lines changed: 67 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -773,7 +773,73 @@ SELECT pgml.transform(
773
773
#Vector Database
774
774
A vector database is a type of database that stores and manages vectors, which are mathematical representations of data points in a multi-dimensional space. Vectors can be used to represent a wide range of data types, including images, text, audio, and numerical data. It is designed to support efficient searching and retrieval of vectors, using methods such as nearest neighbor search, clustering, and indexing. These methods enable applications to find vectors that are similar to a given query vector, which is useful for tasks such as image search, recommendation systems, and natural language processing.
775
775
776
-
PostgresML allows you to use your existing PostgreSQL database as a vector database by generating embeddings from text stored in the database. To generate embeddings, you can use the`pgml.embed` function, which takes a transformer name and a text value as input. This function automatically downloads and caches the transformer for future reuse, which saves time and resources.
776
+
PostgresML enhances your existing PostgreSQL database to be used as a vector database by generating embeddings from text stored in your tables. To generate embeddings, you can use the`pgml.embed` function, which takes a transformer name and a text value as input. This function automatically downloads and caches the transformer for future reuse, which saves time and resources.
777
+
778
+
Using a vector database involves three key steps: creating embeddings, indexing your embeddings using different algorithms, and querying the index using embeddings for your queries. Let's break down each step in more detail.
779
+
780
+
##Step 1: Creating embeddings using transformers
781
+
To create embeddings for your data, you first need to choose a transformer that can generate embeddings from your input data. Some popular transformer options include BERT, GPT-2, and T5. Once you've selected a transformer, you can use it to generate embeddings for your data.
782
+
783
+
In the following section, we will demonstrate how to use PostgresML to generate embeddings for a dataset of tweets commonly used in sentiment analysis. To generate the embeddings, we will use the`pgml.embed` function, which will generate an embedding for each tweet in the dataset. These embeddings will then be inserted into a table called tweet_embeddings.
|"QT@user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"|{-0.1567948312,-0.3149209619,0.2163394839,..}|
803
+
|"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ"|{-0.0701668188,-0.012231146,0.1304316372,.. }|
804
+
805
+
806
+
##Step 2: Indexing your embeddings using different algorithms
807
+
After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches.
808
+
809
+
On small datasets (<100k rows), a linear search that compares every row to the query will give sub-second results, which may be fast enough for your use case. For larger datasets, you may want to consider various indexing strategies offered by additional extensions.
810
+
811
+
- <ahref="https://www.postgresql.org/docs/current/cube.html"target="_blank">Cube</a> is a built-in extension that provides a fast indexing strategy for finding similar vectors. By default it has an arbitrary limit of 100 dimensions, unless Postgres is compiled with a larger size.
812
+
- <ahref="https://github.com/pgvector/pgvector"target="_blank">PgVector</a> supports embeddings up to 2000 dimensions out of the box, and provides a fast indexing strategy for finding similar vectors.
813
+
814
+
When indexing your embeddings, it's important to consider the trade-offs between accuracy and speed. Exact indexing algorithms like B-trees can provide precise results, but may not be as fast as approximate indexing algorithms like KNN and ANN. Similarly, some indexing algorithms may require more memory or disk space than others.
815
+
816
+
In the following, we are creating an index on the tweet_embeddings table using the ivfflat algorithm for indexing. The ivfflat algorithm is a type of hybrid index that combines an Inverted File (IVF) index with a Flat (FLAT) index.
817
+
818
+
The index is being created on the embedding column in the tweet_embeddings table, which contains vector embeddings generated from the original tweet dataset. The`vector_cosine_ops` argument specifies the indexing operation to use for the embeddings. In this case, it's using the`cosine similarity` operation, which is a common method for measuring similarity between vectors.
819
+
820
+
By creating an index on the embedding column, the database can quickly search for and retrieve records that are similar to a given query vector. This can be useful for a variety of machine learning applications, such as similarity search or recommendation systems.
821
+
822
+
```sql
823
+
CREATEINDEXON tweet_embeddings USING ivfflat (embedding vector_cosine_ops);
824
+
```
825
+
##Step 3: Querying the index using embeddings for your queries
826
+
Once your embeddings have been indexed, you can use them to perform queries against your database. To do this, you'll need to provide a query embedding that represents the query you want to perform. The index will then return the closest matching embeddings from your database, based on the similarity between the query embedding and the stored embeddings.
827
+
828
+
```sql
829
+
WITH queryAS (
830
+
SELECTpgml.embed('distilbert-base-uncased','Star Wars christmas special is on Disney')::vectorAS embedding