Jun 18, 2024 · Jun 11, 2024 · Jun 12, 2024 · Jun 14, 2024 · Jun 17, 2024 · Jun 17, 2024
diff --git a/pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md b/pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
 | Document ID | Document text |
 -----|----------|
 | 1 | The pgml.transform function is a PostgreSQL function for calling LLMs in the database. |
 | 2 | I thinktomatos are incredible on burgers. |
 | 2 | I thinktomatoes are incredible on burgers. |


 and a user is looking for the answer to the question: "What is the pgml.transform function?". If we embed the search query and all of the documents using a model like _mixedbread-ai/mxbai-embed-large-v1_, we can compare the query embedding to all of the document embeddings, and select the document that has the closest embedding in vector space, and therefore in meaning, to the to the answer.

 !!! generic

 !!! code_block
 !!! code_block time="64.643 ms"

 ```postgresql
 SELECT '[1,2,3]'::vector <=> '[2,3,4]'::vector;
  <=>
 pgml.embed(
  'mixedbread-ai/mxbai-embed-large-v1',
  'I thinktomatos are incredible on burgers.'
  'I thinktomatoes are incredible on burgers.'
 )::vector AS cosine_distance;
 ```


 cosine_distance
 --------------------
 0.7383001059221699
 0.7328613577628744
 ```

 !!!

 !!!

 You'll notice that the distance between "What is the pgml.transform function?" and "The pgml.transform function is a PostgreSQL function for calling LLMs in the database." is much smaller than the cosine distance between "What is the pgml.transform function?" and "I thinktomatos are incredible on burgers".
 You'll notice that the distance between "What is the pgml.transform function?" and "The pgml.transform function is a PostgreSQL function for calling LLMs in the database." is much smaller than the cosine distance between "What is the pgml.transform function?" and "I thinktomatoes are incredible on burgers".

 ## Making it fast!

  ),

  (
    'I thinktomatos are incredible on burgers.',
    'I thinktomatoes are incredible on burgers.',
    pgml.embed(
      'mixedbread-ai/mxbai-embed-large-v1',
      'I thinktomatos are incredible on burgers.'
      'I thinktomatoes are incredible on burgers.'
    )
  );
 ```

 !!!

 This query is fast for now, but as we add more data to thethetable, it will slow down because we have not indexed the embedding column.
 This query is fast for now, but as we add more data to the table, it will slow down because we have not indexed the embedding column.

 Let's demonstrate this by inserting 100,000 additional embeddings:


 This somewhat less than ideal performance can be fixed by indexing the embedding column. There are two types of indexes available in _pgvector_: IVFFlat and HNSW.

 IVFFlat indexes clusters the table into sublists, and when searching, only searches over a fixed number of sublists. Inoour example, if we were to add an IVFFlat index with 10 lists:
 IVFFlat indexes clusters the table into sublists, and when searching, only searches over a fixed number of sublists. Inour example, if we were to add an IVFFlat index with 10 lists:

 !!! generic


 !!!

 and search again, we would get much betterperfomance:
 and search again, we would get much betterperformance:

 !!! generic
Original file line number	Diff line number	Diff line change
Expand Up		@@ -106,7 +106,7 @@ For instance let’s say that we have the following documents:
		\| Document ID \| Document text \|
		-----\|----------\|
		\| 1 \| The pgml.transform function is a PostgreSQL function for calling LLMs in the database. \|
		\| 2 \| I thinktomatos are incredible on burgers. \|
		\| 2 \| I thinktomatoes are incredible on burgers. \|


		and a user is looking for the answer to the question: "What is the pgml.transform function?". If we embed the search query and all of the documents using a model like _mixedbread-ai/mxbai-embed-large-v1_, we can compare the query embedding to all of the document embeddings, and select the document that has the closest embedding in vector space, and therefore in meaning, to the to the answer.
Expand All		@@ -130,7 +130,7 @@ This is a somewhat confusing formula but luckily _pgvector_ provides an operato

		!!! generic

		!!! code_block
		!!! code_block time="64.643 ms"

		```postgresql
		SELECT '[1,2,3]'::vector <=> '[2,3,4]'::vector;
Expand DownExpand Up		@@ -176,7 +176,7 @@ SELECT pgml.embed(
		<=>
		pgml.embed(
		'mixedbread-ai/mxbai-embed-large-v1',
		'I thinktomatos are incredible on burgers.'
		'I thinktomatoes are incredible on burgers.'
		)::vector AS cosine_distance;
		```

Expand All		@@ -191,14 +191,14 @@ cosine_distance

		cosine_distance
		--------------------
		0.7383001059221699
		0.7328613577628744
		```

		!!!

		!!!

		You'll notice that the distance between "What is the pgml.transform function?" and "The pgml.transform function is a PostgreSQL function for calling LLMs in the database." is much smaller than the cosine distance between "What is the pgml.transform function?" and "I thinktomatos are incredible on burgers".
		You'll notice that the distance between "What is the pgml.transform function?" and "The pgml.transform function is a PostgreSQL function for calling LLMs in the database." is much smaller than the cosine distance between "What is the pgml.transform function?" and "I thinktomatoes are incredible on burgers".
Copy link Contributor montanalowJun 18, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. It's probably worth mentioning "out of sample" tokens, since they are rife on domain specific topics like this. Copy link ContributorAuthor SilasMarvinJun 18, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. I'm not sure what that is?

		## Making it fast!

Expand DownExpand Up		@@ -228,10 +228,10 @@ VALUES
		),

		(
		'I thinktomatos are incredible on burgers.',
		'I thinktomatoes are incredible on burgers.',
		pgml.embed(
		'mixedbread-ai/mxbai-embed-large-v1',
		'I thinktomatos are incredible on burgers.'
		'I thinktomatoes are incredible on burgers.'
		)
		);
		```
Expand DownExpand Up		@@ -282,7 +282,7 @@ LIMIT 1;

		!!!

		This query is fast for now, but as we add more data to thethetable, it will slow down because we have not indexed the embedding column.
		This query is fast for now, but as we add more data to the table, it will slow down because we have not indexed the embedding column.

		Let's demonstrate this by inserting 100,000 additional embeddings:

Expand DownExpand Up		@@ -344,7 +344,7 @@ LIMIT 1;

		This somewhat less than ideal performance can be fixed by indexing the embedding column. There are two types of indexes available in _pgvector_: IVFFlat and HNSW.
SilasMarvin marked this conversation as resolved. Show resolvedHide resolved

		IVFFlat indexes clusters the table into sublists, and when searching, only searches over a fixed number of sublists. Inoour example, if we were to add an IVFFlat index with 10 lists:
		IVFFlat indexes clusters the table into sublists, and when searching, only searches over a fixed number of sublists. Inour example, if we were to add an IVFFlat index with 10 lists:

		!!! generic

Expand All		@@ -360,7 +360,7 @@ WITH (lists = 10);

		!!!

		and search again, we would get much betterperfomance:
		and search again, we would get much betterperformance:

		!!! generic

Expand Down