|
| 1 | +--- |
| 2 | +author:Silas Marvin |
| 3 | +description:HNSW indexing is the latest upgrade in vector recall performance. In this post we announce our updated SDK that utilizes HNSW indexing to give world class performance in vector search. |
| 4 | +image:https://postgresml.org/dashboard/static/images/blog/announcing_hnsw_support.webp |
| 5 | +image_alt:HNSW provides a significant improvement in recall speed compared to IVFFlat |
| 6 | +--- |
| 7 | + |
| 8 | +#Announcing HNSW Support in Our SDK |
| 9 | + |
| 10 | +<divclass="d-flex align-items-center mb-4"> |
| 11 | + <imgwidth="54px"height="54px"src="/dashboard/static/images/team/silas.jpg"style="border-radius:50%;"alt="Author" /> |
| 12 | + <divclass="ps-3 d-flex justify-content-center flex-column"> |
| 13 | +<p class="m-0">Silas Marvin</p> |
| 14 | +<p class="m-0">September 21, 2023</p> |
| 15 | + </div> |
| 16 | +</div> |
| 17 | + |
| 18 | +PostgresML makes it easy to use machine learning with your database and to scale workloads horizontally in our cloud. Our SDK makes it even easier. |
| 19 | + |
| 20 | +<imgsrc="/dashboard/static/images/blog/announcing_hnsw_support.webp"alt="data is always the best medicine" /> |
| 21 | +<center><p><i>HNSW (hierarchical navigable small worlds) is an indexing method that greatly improves vector recall</i></p></center> |
| 22 | + |
| 23 | +##Introducing HNSW |
| 24 | + |
| 25 | +Underneath the hood our SDK utilizes[pgvector](https://github.com/pgvector/pgvector) to store, index, and recall vectors. Up until this point our SDK used IVFFlat indexing to divide vectors into lists, search a subset of those lists, and return the closest vector matches. |
| 26 | + |
| 27 | +While the IVFFlat indexing method is fast, it is not as fast as HNSW. Thanks to the latest update of[pgvector](https://github.com/pgvector/pgvector) our SDK now utilizes HNSW indexing, creating multi-layer graphs instead of lists and removing the required training step IVFFlat imposed. |
| 28 | + |
| 29 | +The results are not disappointing. |
| 30 | + |
| 31 | +##Comparing HNSW and IVFFlat |
| 32 | + |
| 33 | +In one of our previous posts:[Tuning vector recall while generating query embeddings in the database](/blog/tuning-vector-recall-while-generating-query-embeddings-in-the-database) we were working on a dataset with over 5 million Amazon Movie Reviews, and after embedding the reviews, performed semantic similarity search to get the closest 5 reviews. |
| 34 | + |
| 35 | +Let's run that query again: |
| 36 | + |
| 37 | +!!! generic |
| 38 | + |
| 39 | +!!! code_block time="89.118 ms" |
| 40 | + |
| 41 | +```postgresql |
| 42 | +WITH request AS ( |
| 43 | + SELECT pgml.embed( |
| 44 | + 'intfloat/e5-large', |
| 45 | + 'query: Best 1980''s scifi movie' |
| 46 | + )::vector(1024) AS embedding |
| 47 | +) |
| 48 | +
|
| 49 | +SELECT |
| 50 | + id, |
| 51 | + 1 - ( |
| 52 | + review_embedding_e5_large <=> ( |
| 53 | + SELECT embedding FROM request |
| 54 | + ) |
| 55 | + ) AS cosine_similarity |
| 56 | +FROM pgml.amazon_us_reviews |
| 57 | +ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request) |
| 58 | +LIMIT 5; |
| 59 | +``` |
| 60 | + |
| 61 | +!!! |
| 62 | + |
| 63 | +!!! results |
| 64 | + |
| 65 | +| review_body | product_title | star_rating | total_votes | cosine_similarity |
| 66 | +| -------------------------------------------------| -------------------------------------------------------------| -------------| -----------| ------------------| |
| 67 | +| best 80s SciFi movie ever| The Adventures of Buckaroo Banzai Across the Eighth Dimension| 5| 1| 0.9495371273162286| |
| 68 | +| the best of 80s sci fi horror!| The Blob| 5| 2| 0.9097434758143605| |
| 69 | +| Three of the best sci-fi movies of the seventies| Sci-Fi: Triple Feature (BD)[Blu-ray]| 5| 0| 0.9008723412875651| |
| 70 | +| best sci fi movie ever| The Day the Earth Stood Still (Special Edition)[Blu-ray]| 5| 2| 0.8943620968858654| |
| 71 | +| Great Science Fiction movie| Bloodsport / Timecop (Action Double Feature)[Blu-ray]| 5| 0| 0.894282454374093| |
| 72 | + |
| 73 | +!!! |
| 74 | + |
| 75 | +!!! |
| 76 | + |
| 77 | +This query utilized IVFFlat indexing and queried through over 5 million rows in 89.118ms. Pretty fast! |
| 78 | + |
| 79 | +Let's drop our IVFFlat index and create an HNSW index. |
| 80 | + |
| 81 | +!!! generic |
| 82 | + |
| 83 | +!!! code_block time="10255099.233 ms (02:50:55.099)" |
| 84 | + |
| 85 | +```postgresql |
| 86 | +DROP INDEX index_amazon_us_reviews_on_review_embedding_e5_large; |
| 87 | +CREATE INDEX CONCURRENTLY ON pgml.amazon_us_reviews USING hnsw (review_embedding_e5_large vector_cosine_ops); |
| 88 | +``` |
| 89 | + |
| 90 | +!!! |
| 91 | + |
| 92 | +!!! results |
| 93 | + |
| 94 | +|CREATE INDEX| |
| 95 | +|------------| |
| 96 | + |
| 97 | +!!! |
| 98 | + |
| 99 | +!!! |
| 100 | + |
| 101 | +Now let's try the query again utilizing the new HNSW index we created. |
| 102 | + |
| 103 | +!!! generic |
| 104 | + |
| 105 | +!!! code_block time="17.465 ms" |
| 106 | + |
| 107 | +```postgresql |
| 108 | +WITH request AS ( |
| 109 | + SELECT pgml.embed( |
| 110 | + 'intfloat/e5-large', |
| 111 | + 'query: Best 1980''s scifi movie' |
| 112 | + )::vector(1024) AS embedding |
| 113 | +) |
| 114 | +
|
| 115 | +SELECT |
| 116 | + id, |
| 117 | + 1 - ( |
| 118 | + review_embedding_e5_large <=> ( |
| 119 | + SELECT embedding FROM request |
| 120 | + ) |
| 121 | + ) AS cosine_similarity |
| 122 | +FROM pgml.amazon_us_reviews |
| 123 | +ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request) |
| 124 | +LIMIT 5; |
| 125 | +``` |
| 126 | + |
| 127 | +!!! |
| 128 | + |
| 129 | +!!! results |
| 130 | + |
| 131 | +| review_body | product_title | star_rating | total_votes | cosine_similarity |
| 132 | +| ---------------------------------| -------------------------------------------------------------| -------------| -----------| ------------------| |
| 133 | +| best 80s SciFi movie ever| The Adventures of Buckaroo Banzai Across the Eighth Dimension| 5| 1| 0.9495371273162286| |
| 134 | +| the best of 80s sci fi horror!| The Blob| 5| 2| 0.9097434758143605| |
| 135 | +| One of the Better 80's Sci-Fi| Krull (Special Edition)| 3| 5| 0.9093884940741694| |
| 136 | +| Good 1980s movie| Can't Buy Me Love| 4| 0| 0.9090294438721961| |
| 137 | +| great 80's movie| How I Got Into College| 5| 0| 0.9016508795301296| |
| 138 | + |
| 139 | +!!! |
| 140 | + |
| 141 | +!!! |
| 142 | + |
| 143 | +Not only are the results better (the`cosine_similarity` is higher overall), but HNSW is over 5x faster, reducing our search and embedding time to 17.465ms. |
| 144 | + |
| 145 | +This is a massive upgrade to the recall speed utilized by our SDK and greatly improves overall performance. |
| 146 | + |
| 147 | +For a deeper dive into HNSW checkout[Jonathan Katz's excellent article on HNSW in pgvector](https://jkatz05.com/post/postgres/pgvector-hnsw-performance/). |