- Notifications
You must be signed in to change notification settings - Fork18
Nearest neighbor search for Rails
License
ankane/neighbor
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Nearest neighbor search for Rails
Supports:
- Postgres (cube and pgvector)
- MariaDB 11.8
- MySQL 9 (searching requires HeatWave) - experimental
- SQLite (sqlite-vec) - experimental
Also available forRedis andS3 Vectors
Add this line to your application’s Gemfile:
gem"neighbor"
Neighbor supports two extensions:cube andpgvector. cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
For cube, run:
rails generate neighbor:cuberails db:migrate
For pgvector,install the extension and run:
rails generate neighbor:vectorrails db:migrate
Add this line to your application’s Gemfile:
gem"sqlite-vec"
And run:
rails generate neighbor:sqlite
Create a migration
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchange# cubeadd_column:items,:embedding,:cube# pgvector, MariaDB, and MySQLadd_column:items,:embedding,:vector,limit:3# dimensions# sqlite-vecadd_column:items,:embedding,:binaryendend
Add to your model
classItem <ApplicationRecordhas_neighbors:embeddingend
Update the vectors
item.update(embedding:[1.0,1.2,0.5])
Get the nearest neighbors to a record
item.nearest_neighbors(:embedding,distance:"euclidean").first(5)
Get the nearest neighbors to a vector
Item.nearest_neighbors(:embedding,[0.9,1.3,1.1],distance:"euclidean").first(5)
Records returned fromnearest_neighbors will have aneighbor_distance attribute
nearest_item=item.nearest_neighbors(:embedding,distance:"euclidean").firstnearest_item.neighbor_distance
See the additional docs for:
Or check out someexamples
Supported values are:
euclideancosinetaxicabchebyshev
For cosine distance with cube, vectors must be normalized before being stored.
classItem <ApplicationRecordhas_neighbors:embedding,normalize:trueend
For inner product with cube, seethis example.
Thecube type can have up to 100 dimensions by default. See thePostgres docs for how to increase this.
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
classItem <ApplicationRecordhas_neighbors:embedding,dimensions:3end
Supported values are:
euclideaninner_productcosinetaxicabhammingjaccard
Thevector type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
Thehalfvec type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.
Thebit type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.
Thesparsevec type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.
Add an approximate index to speed up queries. Create a migration with:
classAddIndexToItemsEmbedding <ActiveRecord::Migration[8.1]defchangeadd_index:items,:embedding,using::hnsw,opclass::vector_l2_ops# oradd_index:items,:embedding,using::ivfflat,opclass::vector_l2_opsendend
Use:vector_cosine_ops for cosine distance and:vector_ip_ops for inner product.
Set the size of the dynamic candidate list with HNSW
Item.connection.execute("SET hnsw.ef_search = 100")
Or the number of probes with IVFFlat
Item.connection.execute("SET ivfflat.probes = 3")
Use thehalfvec type to store half-precision vectors
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:halfvec,limit:3# dimensionsendend
Index vectors at half precision for smaller indexes
classAddIndexToItemsEmbedding <ActiveRecord::Migration[8.1]defchangeadd_index:items,"(embedding::halfvec(3)) halfvec_l2_ops",using::hnswendend
Get the nearest neighbors
Item.nearest_neighbors(:embedding,[0.9,1.3,1.1],distance:"euclidean",precision:"half").first(5)
Use thebit type to store binary vectors
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:bit,limit:3# dimensionsendend
Get the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding,"101",distance:"hamming").first(5)
Use expression indexing for binary quantization
classAddIndexToItemsEmbedding <ActiveRecord::Migration[8.1]defchangeadd_index:items,"(binary_quantize(embedding)::bit(3)) bit_hamming_ops",using::hnswendend
Use thesparsevec type to store sparse vectors
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:sparsevec,limit:3# dimensionsendend
Get the nearest neighbors
embedding=Neighbor::SparseVector.new({0=>0.9,1=>1.3,2=>1.1},3)Item.nearest_neighbors(:embedding,embedding,distance:"euclidean").first(5)
Supported values are:
euclideancosinehamming
Vector columns must usenull: false to add a vector index
classCreateItems <ActiveRecord::Migration[8.1]defchangecreate_table:itemsdo |t|t.vector:embedding,limit:3,null:falset.index:embedding,type::vectorendendend
Use thebigint type to store binary vectors
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:bigintendend
Note: Binary vectors can have up to 64 dimensions
Get the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding,5,distance:"hamming").first(5)
Supported values are:
euclideancosinehamming
Note: TheDISTANCE() function isonly available on HeatWave
Use thebinary type to store binary vectors
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:binaryendend
Get the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding,"\x05",distance:"hamming").first(5)
Supported values are:
euclideancosinetaxicabhamming
For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
classItem <ApplicationRecordhas_neighbors:embedding,dimensions:3end
You can also usevirtual tables
classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchange# Rails 8+create_virtual_table:items,:vec0,["id integer PRIMARY KEY AUTOINCREMENT NOT NULL","embedding float[3] distance_metric=L2"]# Rails < 8execute<<~SQL CREATE VIRTUAL TABLE items USING vec0( id integer PRIMARY KEY AUTOINCREMENT NOT NULL, embedding float[3] distance_metric=L2 ) SQLendend
Usedistance_metric=cosine for cosine distance
You can optionally ignore any shadow tables that are created
ActiveRecord::SchemaDumper.ignore_tables +=["items_chunks","items_rowids","items_vector_chunks00"]
Get thek nearest neighbors
Item.where("embedding MATCH ?",[1,2,3].to_s).where(k:5).order(:distance)
Filter by primary key
Item.where(id:[2,3]).where("embedding MATCH ?",[1,2,3].to_s).where(k:5).order(:distance)
Use thetype option for int8 vectors
classItem <ApplicationRecordhas_neighbors:embedding,dimensions:3,type::int8end
Use thetype option for binary vectors
classItem <ApplicationRecordhas_neighbors:embedding,dimensions:8,type::bitend
Get the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding,"\x05",distance:"hamming").first(5)
- Embeddings with OpenAI
- Binary embeddings with Cohere
- Sentence embeddings with Informers
- Hybrid search with Informers
- Sparse search with Transformers.rb
- Recommendations with Disco
Generate a model
rails generate model Document content:text embedding:vector{1536}rails db:migrateAnd addhas_neighbors
classDocument <ApplicationRecordhas_neighbors:embeddingend
Create a method to call theembeddings API
defembed(input)url="https://api.openai.com/v1/embeddings"headers={"Authorization"=>"Bearer#{ENV.fetch("OPENAI_API_KEY")}","Content-Type"=>"application/json"}data={input:input,model:"text-embedding-3-small"}response=Net::HTTP.post(URI(url),data.to_json,headers).tap(&:value)JSON.parse(response.body)["data"].map{ |v|v["embedding"]}end
Pass your input
input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=embed(input)
Store the embeddings
documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:embedding}endDocument.insert_all!(documents)
And get similar documents
document=Document.firstdocument.nearest_neighbors(:embedding,distance:"cosine").first(5).map(&:content)
See thecomplete code
Generate a model
rails generate model Document content:text embedding:bit{1536}rails db:migrateAnd addhas_neighbors
classDocument <ApplicationRecordhas_neighbors:embeddingend
Create a method to call theembed API
defembed(input,input_type)url="https://api.cohere.com/v2/embed"headers={"Authorization"=>"Bearer#{ENV.fetch("CO_API_KEY")}","Content-Type"=>"application/json"}data={texts:input,model:"embed-v4.0",input_type:input_type,embedding_types:["ubinary"]}response=Net::HTTP.post(URI(url),data.to_json,headers).tap(&:value)JSON.parse(response.body)["embeddings"]["ubinary"].map{ |e|e.map{ |v|v.chr.unpack1("B*")}.join}end
Pass your input
input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=embed(input,"search_document")
Store the embeddings
documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:embedding}endDocument.insert_all!(documents)
Embed the search query
query="forest"query_embedding=embed([query],"search_query")[0]
And search the documents
Document.nearest_neighbors(:embedding,query_embedding,distance:"hamming").first(5).map(&:content)
See thecomplete code
You can generate embeddings locally withInformers.
Generate a model
rails generate model Document content:text embedding:vector{384}rails db:migrateAnd addhas_neighbors
classDocument <ApplicationRecordhas_neighbors:embeddingend
Load amodel
model=Informers.pipeline("embedding","sentence-transformers/all-MiniLM-L6-v2")
Pass your input
input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=model.(input)
Store the embeddings
documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:embedding}endDocument.insert_all!(documents)
And get similar documents
document=Document.firstdocument.nearest_neighbors(:embedding,distance:"cosine").first(5).map(&:content)
See thecomplete code
You can use Neighbor for hybrid search withInformers.
Generate a model
rails generate model Document content:text embedding:vector{768}rails db:migrateAnd addhas_neighbors and a scope for keyword search
classDocument <ApplicationRecordhas_neighbors:embeddingscope:search,->(query){where("to_tsvector(content) @@ plainto_tsquery(?)",query).order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC",query))}end
Create some documents
Document.create!(content:"The dog is barking")Document.create!(content:"The cat is purring")Document.create!(content:"The bear is growling")
Generate an embedding for each document
embed=Informers.pipeline("embedding","Snowflake/snowflake-arctic-embed-m-v1.5")embed_options={model_output:"sentence_embedding",pooling:"none"}# specific to embedding modelDocument.find_eachdo |document|embedding=embed.(document.content, **embed_options)document.update!(embedding:embedding)end
Perform keyword search
query="growling bear"keyword_results=Document.search(query).limit(20).load_async
And semantic search in parallel (the query prefix is specific to theembedding model)
query_prefix="Represent this sentence for searching relevant passages: "query_embedding=embed.(query_prefix +query, **embed_options)semantic_results=Document.nearest_neighbors(:embedding,query_embedding,distance:"cosine").limit(20).load_async
To combine the results, use Reciprocal Rank Fusion (RRF)
Neighbor::Reranking.rrf(keyword_results,semantic_results).first(5)
Or a reranking model
rerank=Informers.pipeline("reranking","mixedbread-ai/mxbai-rerank-xsmall-v1")results=(keyword_results +semantic_results).uniqrerank.(query,results.map(&:content)).first(5).map{ |v|results[v[:doc_id]]}
See thecomplete code
You can generate sparse embeddings locally withTransformers.rb.
Generate a model
rails generate model Document content:text embedding:sparsevec{30522}rails db:migrateAnd addhas_neighbors
classDocument <ApplicationRecordhas_neighbors:embeddingend
Load amodel to generate embeddings
classEmbeddingModeldefinitialize(model_id)@model=Transformers::AutoModelForMaskedLM.from_pretrained(model_id)@tokenizer=Transformers::AutoTokenizer.from_pretrained(model_id)@special_token_ids=@tokenizer.special_tokens_map.map{ |_,token|@tokenizer.vocab[token]}enddefembed(input)feature=@tokenizer.(input,padding:true,truncation:true,return_tensors:"pt",return_token_type_ids:false)output=@model.(**feature)[0]values=Torch.max(output *feature[:attention_mask].unsqueeze(-1),dim:1)[0]values=Torch.log(1 +Torch.relu(values))values[0..,@special_token_ids]=0values.to_aendendmodel=EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")
Pass your input
input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=model.embed(input)
Store the embeddings
documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:Neighbor::SparseVector.new(embedding)}endDocument.insert_all!(documents)
Embed the search query
query="forest"query_embedding=model.embed([query])[0]
And search the documents
Document.nearest_neighbors(:embedding,Neighbor::SparseVector.new(query_embedding),distance:"inner_product").first(5).map(&:content)
See thecomplete code
You can use Neighbor for online item-based recommendations withDisco. We’ll use MovieLens data for this example.
Generate a model
rails generate model Movie name:string factors:cuberails db:migrate
And addhas_neighbors
classMovie <ApplicationRecordhas_neighbors:factors,dimensions:20,normalize:trueend
Fit the recommender
data=Disco.load_movielensrecommender=Disco::Recommender.new(factors:20)recommender.fit(data)
Store the item factors
movies=[]recommender.item_ids.eachdo |item_id|movies <<{name:item_id,factors:recommender.item_factors(item_id)}endMovie.create!(movies)
And get similar movies
movie=Movie.find_by(name:"Star Wars (1977)")movie.nearest_neighbors(:factors,distance:"cosine").first(5).map(&:name)
See the complete code forcube andpgvector
View thechangelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs andsubmit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/neighbor.gitcd neighborbundle install# Postgrescreatedb neighbor_testbundleexec rake test:postgresql# SQLitebundleexec rake test:sqlite# MariaDBdocker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 mariadb:11.8bundleexec rake test:mariadb# MySQLdocker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9bundleexec rake test:mysql
About
Nearest neighbor search for Rails
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.