Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Added blog post semantic search in postgres in 15 minutes#1535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Merged
Changes from1 commit
Commits
Show all changes
21 commits
Select commitHold shift + click to select a range
18f8f44
Preliminary draft of semantic search in postgres in 15 minutes
SilasMarvinJun 11, 2024
00bd75d
Cleanups
SilasMarvinJun 12, 2024
068af92
Ready for review
SilasMarvinJun 14, 2024
a9148da
Cleanup first paragraph
SilasMarvinJun 17, 2024
3e0fa33
A few suggestions (#1536)
levkkJun 17, 2024
c71fcd2
Add reason on why to use semantic search
SilasMarvinJun 17, 2024
9b6e75f
Clean up spelling errors
SilasMarvinJun 17, 2024
b451c9b
Fix more small spelling errors
SilasMarvinJun 17, 2024
d418deb
Finish timings
SilasMarvinJun 18, 2024
84872ac
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
1686f93
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
b2b9d88
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
b8766bd
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
4574183
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
4db2149
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
68368e2
Update pgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
SilasMarvinJun 18, 2024
af8dd3e
Convert italics back to backticks
SilasMarvinJun 18, 2024
2c156ae
Remove hnsw link out
SilasMarvinJun 18, 2024
faf0be1
Alude to arrays
SilasMarvinJun 18, 2024
27445f5
Finalize post
SilasMarvinJun 18, 2024
427f77f
Merge branch 'master' into silas-semantic-search-in-postgres-in-15-mi…
SilasMarvinJun 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
PrevPrevious commit
NextNext commit
Clean up spelling errors
  • Loading branch information
@SilasMarvin
SilasMarvin committedJun 17, 2024
commit9b6e75fddad0cc8fa329c25aee304e5e60caf6ee
20 changes: 10 additions & 10 deletionspgml-cms/blog/semantic-search-in-postgres-in-15-minutes.md
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -106,7 +106,7 @@ For instance let’s say that we have the following documents:
| Document ID | Document text |
-----|----------|
| 1 | The pgml.transform function is a PostgreSQL function for calling LLMs in the database. |
| 2 | I thinktomatos are incredible on burgers. |
| 2 | I thinktomatoes are incredible on burgers. |


and a user is looking for the answer to the question: "What is the pgml.transform function?". If we embed the search query and all of the documents using a model like _mixedbread-ai/mxbai-embed-large-v1_, we can compare the query embedding to all of the document embeddings, and select the document that has the closest embedding in vector space, and therefore in meaning, to the to the answer.
Expand All@@ -130,7 +130,7 @@ This is a somewhat confusing formula but luckily _pgvector_ provides an operato

!!! generic

!!! code_block
!!! code_block time="64.643 ms"

```postgresql
SELECT '[1,2,3]'::vector <=> '[2,3,4]'::vector;
Expand DownExpand Up@@ -176,7 +176,7 @@ SELECT pgml.embed(
<=>
pgml.embed(
'mixedbread-ai/mxbai-embed-large-v1',
'I thinktomatos are incredible on burgers.'
'I thinktomatoes are incredible on burgers.'
)::vector AS cosine_distance;
```

Expand All@@ -191,14 +191,14 @@ cosine_distance

cosine_distance
--------------------
0.7383001059221699
0.7328613577628744
```

!!!

!!!

You'll notice that the distance between "What is the pgml.transform function?" and "The pgml.transform function is a PostgreSQL function for calling LLMs in the database." is much smaller than the cosine distance between "What is the pgml.transform function?" and "I thinktomatos are incredible on burgers".
You'll notice that the distance between "What is the pgml.transform function?" and "The pgml.transform function is a PostgreSQL function for calling LLMs in the database." is much smaller than the cosine distance between "What is the pgml.transform function?" and "I thinktomatoes are incredible on burgers".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's probably worth mentioning "out of sample" tokens, since they are rife on domain specific topics like this.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not sure what that is?


## Making it fast!

Expand DownExpand Up@@ -228,10 +228,10 @@ VALUES
),

(
'I thinktomatos are incredible on burgers.',
'I thinktomatoes are incredible on burgers.',
pgml.embed(
'mixedbread-ai/mxbai-embed-large-v1',
'I thinktomatos are incredible on burgers.'
'I thinktomatoes are incredible on burgers.'
)
);
```
Expand DownExpand Up@@ -282,7 +282,7 @@ LIMIT 1;

!!!

This query is fast for now, but as we add more data to thethetable, it will slow down because we have not indexed the embedding column.
This query is fast for now, but as we add more data to the table, it will slow down because we have not indexed the embedding column.

Let's demonstrate this by inserting 100,000 additional embeddings:

Expand DownExpand Up@@ -344,7 +344,7 @@ LIMIT 1;

This somewhat less than ideal performance can be fixed by indexing the embedding column. There are two types of indexes available in _pgvector_: IVFFlat and HNSW.

IVFFlat indexes clusters the table into sublists, and when searching, only searches over a fixed number of sublists. Inoour example, if we were to add an IVFFlat index with 10 lists:
IVFFlat indexes clusters the table into sublists, and when searching, only searches over a fixed number of sublists. Inour example, if we were to add an IVFFlat index with 10 lists:

!!! generic

Expand All@@ -360,7 +360,7 @@ WITH (lists = 10);

!!!

and search again, we would get much betterperfomance:
and search again, we would get much betterperformance:

!!! generic

Expand Down
Loading

[8]ページ先頭

©2009-2025 Movatter.jp