ankane/neighborPublic

NotificationsYou must be signed in to change notification settings
Fork18
Star775

Nearest neighbor search for Rails

License

MIT license

775 stars 18 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
.github/workflows		.github/workflows
examples		examples
gemfiles		gemfiles
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
neighbor.gemspec		neighbor.gemspec

Repository files navigation

Neighbor

Nearest neighbor search for Rails

Supports:

Postgres (cube and pgvector)
MariaDB 11.8
MySQL 9 (searching requires HeatWave) - experimental
SQLite (sqlite-vec) - experimental

Also available forRedis andS3 Vectors

Installation

Add this line to your application’s Gemfile:

gem"neighbor"

For Postgres

Neighbor supports two extensions:cube andpgvector. cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.

For cube, run:

rails generate neighbor:cuberails db:migrate

For pgvector,install the extension and run:

rails generate neighbor:vectorrails db:migrate

For SQLite

Add this line to your application’s Gemfile:

gem"sqlite-vec"

And run:

rails generate neighbor:sqlite

Getting Started

Create a migration

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchange# cubeadd_column:items,:embedding,:cube# pgvector, MariaDB, and MySQLadd_column:items,:embedding,:vector,limit:3# dimensions# sqlite-vecadd_column:items,:embedding,:binaryendend

Add to your model

classItem <ApplicationRecordhas_neighbors:embeddingend

Update the vectors

item.update(embedding:[1.0,1.2,0.5])

Get the nearest neighbors to a record

item.nearest_neighbors(:embedding,distance:"euclidean").first(5)

Get the nearest neighbors to a vector

Item.nearest_neighbors(:embedding,[0.9,1.3,1.1],distance:"euclidean").first(5)

Records returned fromnearest_neighbors will have aneighbor_distance attribute

nearest_item=item.nearest_neighbors(:embedding,distance:"euclidean").firstnearest_item.neighbor_distance

See the additional docs for:

Or check out someexamples

cube

Distance

Supported values are:

euclidean
cosine
taxicab
chebyshev

For cosine distance with cube, vectors must be normalized before being stored.

classItem <ApplicationRecordhas_neighbors:embedding,normalize:trueend

For inner product with cube, seethis example.

Dimensions

Thecube type can have up to 100 dimensions by default. See thePostgres docs for how to increase this.

For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

classItem <ApplicationRecordhas_neighbors:embedding,dimensions:3end

pgvector

Distance

Supported values are:

euclidean
inner_product
cosine
taxicab
hamming
jaccard

Dimensions

Thevector type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.

Thehalfvec type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.

Thebit type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.

Thesparsevec type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.

Indexing

Add an approximate index to speed up queries. Create a migration with:

classAddIndexToItemsEmbedding <ActiveRecord::Migration[8.1]defchangeadd_index:items,:embedding,using::hnsw,opclass::vector_l2_ops# oradd_index:items,:embedding,using::ivfflat,opclass::vector_l2_opsendend

Use:vector_cosine_ops for cosine distance and:vector_ip_ops for inner product.

Set the size of the dynamic candidate list with HNSW

Item.connection.execute("SET hnsw.ef_search = 100")

Or the number of probes with IVFFlat

Item.connection.execute("SET ivfflat.probes = 3")

Half-Precision Vectors

Use thehalfvec type to store half-precision vectors

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:halfvec,limit:3# dimensionsendend

Half-Precision Indexing

Index vectors at half precision for smaller indexes

classAddIndexToItemsEmbedding <ActiveRecord::Migration[8.1]defchangeadd_index:items,"(embedding::halfvec(3)) halfvec_l2_ops",using::hnswendend

Get the nearest neighbors

Item.nearest_neighbors(:embedding,[0.9,1.3,1.1],distance:"euclidean",precision:"half").first(5)

Binary Vectors

Use thebit type to store binary vectors

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:bit,limit:3# dimensionsendend

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding,"101",distance:"hamming").first(5)

Binary Quantization

Use expression indexing for binary quantization

classAddIndexToItemsEmbedding <ActiveRecord::Migration[8.1]defchangeadd_index:items,"(binary_quantize(embedding)::bit(3)) bit_hamming_ops",using::hnswendend

Sparse Vectors

Use thesparsevec type to store sparse vectors

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:sparsevec,limit:3# dimensionsendend

Get the nearest neighbors

embedding=Neighbor::SparseVector.new({0=>0.9,1=>1.3,2=>1.1},3)Item.nearest_neighbors(:embedding,embedding,distance:"euclidean").first(5)

MariaDB

Distance

Supported values are:

euclidean
cosine
hamming

Indexing

Vector columns must usenull: false to add a vector index

classCreateItems <ActiveRecord::Migration[8.1]defchangecreate_table:itemsdo |t|t.vector:embedding,limit:3,null:falset.index:embedding,type::vectorendendend

Binary Vectors

Use thebigint type to store binary vectors

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:bigintendend

Note: Binary vectors can have up to 64 dimensions

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding,5,distance:"hamming").first(5)

MySQL

Distance

Supported values are:

euclidean
cosine
hamming

Note: TheDISTANCE() function isonly available on HeatWave

Binary Vectors

Use thebinary type to store binary vectors

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchangeadd_column:items,:embedding,:binaryendend

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding,"\x05",distance:"hamming").first(5)

sqlite-vec

Distance

Supported values are:

euclidean
cosine
taxicab
hamming

Dimensions

For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

classItem <ApplicationRecordhas_neighbors:embedding,dimensions:3end

Virtual Tables

You can also usevirtual tables

classAddEmbeddingToItems <ActiveRecord::Migration[8.1]defchange# Rails 8+create_virtual_table:items,:vec0,["id integer PRIMARY KEY AUTOINCREMENT NOT NULL","embedding float[3] distance_metric=L2"]# Rails < 8execute<<~SQL      CREATE VIRTUAL TABLE items USING vec0(        id integer PRIMARY KEY AUTOINCREMENT NOT NULL,        embedding float[3] distance_metric=L2      )    SQLendend

Usedistance_metric=cosine for cosine distance

You can optionally ignore any shadow tables that are created

ActiveRecord::SchemaDumper.ignore_tables +=["items_chunks","items_rowids","items_vector_chunks00"]

Get thek nearest neighbors

Item.where("embedding MATCH ?",[1,2,3].to_s).where(k:5).order(:distance)

Filter by primary key

Item.where(id:[2,3]).where("embedding MATCH ?",[1,2,3].to_s).where(k:5).order(:distance)

Int8 Vectors

Use thetype option for int8 vectors

classItem <ApplicationRecordhas_neighbors:embedding,dimensions:3,type::int8end

Binary Vectors

Use thetype option for binary vectors

classItem <ApplicationRecordhas_neighbors:embedding,dimensions:8,type::bitend

Get the nearest neighbors by Hamming distance

Item.nearest_neighbors(:embedding,"\x05",distance:"hamming").first(5)

Examples

Embeddings with OpenAI
Binary embeddings with Cohere
Sentence embeddings with Informers
Hybrid search with Informers
Sparse search with Transformers.rb
Recommendations with Disco

OpenAI Embeddings

Generate a model

rails generate model Document content:text embedding:vector{1536}rails db:migrate

And addhas_neighbors

classDocument <ApplicationRecordhas_neighbors:embeddingend

Create a method to call theembeddings API

defembed(input)url="https://api.openai.com/v1/embeddings"headers={"Authorization"=>"Bearer#{ENV.fetch("OPENAI_API_KEY")}","Content-Type"=>"application/json"}data={input:input,model:"text-embedding-3-small"}response=Net::HTTP.post(URI(url),data.to_json,headers).tap(&:value)JSON.parse(response.body)["data"].map{ |v|v["embedding"]}end

Pass your input

input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=embed(input)

Store the embeddings

documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:embedding}endDocument.insert_all!(documents)

And get similar documents

document=Document.firstdocument.nearest_neighbors(:embedding,distance:"cosine").first(5).map(&:content)

See thecomplete code

Cohere Embeddings

Generate a model

rails generate model Document content:text embedding:bit{1536}rails db:migrate

And addhas_neighbors

classDocument <ApplicationRecordhas_neighbors:embeddingend

Create a method to call theembed API

defembed(input,input_type)url="https://api.cohere.com/v2/embed"headers={"Authorization"=>"Bearer#{ENV.fetch("CO_API_KEY")}","Content-Type"=>"application/json"}data={texts:input,model:"embed-v4.0",input_type:input_type,embedding_types:["ubinary"]}response=Net::HTTP.post(URI(url),data.to_json,headers).tap(&:value)JSON.parse(response.body)["embeddings"]["ubinary"].map{ |e|e.map{ |v|v.chr.unpack1("B*")}.join}end

Pass your input

input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=embed(input,"search_document")

Store the embeddings

documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:embedding}endDocument.insert_all!(documents)

Embed the search query

query="forest"query_embedding=embed([query],"search_query")[0]

And search the documents

Document.nearest_neighbors(:embedding,query_embedding,distance:"hamming").first(5).map(&:content)

See thecomplete code

Sentence Embeddings

You can generate embeddings locally withInformers.

Generate a model

rails generate model Document content:text embedding:vector{384}rails db:migrate

And addhas_neighbors

classDocument <ApplicationRecordhas_neighbors:embeddingend

Load amodel

model=Informers.pipeline("embedding","sentence-transformers/all-MiniLM-L6-v2")

Pass your input

input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=model.(input)

Store the embeddings

documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:embedding}endDocument.insert_all!(documents)

And get similar documents

document=Document.firstdocument.nearest_neighbors(:embedding,distance:"cosine").first(5).map(&:content)

See thecomplete code

Hybrid Search

You can use Neighbor for hybrid search withInformers.

Generate a model

rails generate model Document content:text embedding:vector{768}rails db:migrate

And addhas_neighbors and a scope for keyword search

classDocument <ApplicationRecordhas_neighbors:embeddingscope:search,->(query){where("to_tsvector(content) @@ plainto_tsquery(?)",query).order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC",query))}end

Create some documents

Document.create!(content:"The dog is barking")Document.create!(content:"The cat is purring")Document.create!(content:"The bear is growling")

Generate an embedding for each document

embed=Informers.pipeline("embedding","Snowflake/snowflake-arctic-embed-m-v1.5")embed_options={model_output:"sentence_embedding",pooling:"none"}# specific to embedding modelDocument.find_eachdo |document|embedding=embed.(document.content, **embed_options)document.update!(embedding:embedding)end

Perform keyword search

query="growling bear"keyword_results=Document.search(query).limit(20).load_async

And semantic search in parallel (the query prefix is specific to theembedding model)

query_prefix="Represent this sentence for searching relevant passages: "query_embedding=embed.(query_prefix +query, **embed_options)semantic_results=Document.nearest_neighbors(:embedding,query_embedding,distance:"cosine").limit(20).load_async

To combine the results, use Reciprocal Rank Fusion (RRF)

Neighbor::Reranking.rrf(keyword_results,semantic_results).first(5)

Or a reranking model

rerank=Informers.pipeline("reranking","mixedbread-ai/mxbai-rerank-xsmall-v1")results=(keyword_results +semantic_results).uniqrerank.(query,results.map(&:content)).first(5).map{ |v|results[v[:doc_id]]}

See thecomplete code

Sparse Search

You can generate sparse embeddings locally withTransformers.rb.

Generate a model

rails generate model Document content:text embedding:sparsevec{30522}rails db:migrate

And addhas_neighbors

classDocument <ApplicationRecordhas_neighbors:embeddingend

Load amodel to generate embeddings

classEmbeddingModeldefinitialize(model_id)@model=Transformers::AutoModelForMaskedLM.from_pretrained(model_id)@tokenizer=Transformers::AutoTokenizer.from_pretrained(model_id)@special_token_ids=@tokenizer.special_tokens_map.map{ |_,token|@tokenizer.vocab[token]}enddefembed(input)feature=@tokenizer.(input,padding:true,truncation:true,return_tensors:"pt",return_token_type_ids:false)output=@model.(**feature)[0]values=Torch.max(output *feature[:attention_mask].unsqueeze(-1),dim:1)[0]values=Torch.log(1 +Torch.relu(values))values[0..,@special_token_ids]=0values.to_aendendmodel=EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")

Pass your input

input=["The dog is barking","The cat is purring","The bear is growling"]embeddings=model.embed(input)

Store the embeddings

documents=[]input.zip(embeddings)do |content,embedding|documents <<{content:content,embedding:Neighbor::SparseVector.new(embedding)}endDocument.insert_all!(documents)

Embed the search query

query="forest"query_embedding=model.embed([query])[0]

And search the documents

Document.nearest_neighbors(:embedding,Neighbor::SparseVector.new(query_embedding),distance:"inner_product").first(5).map(&:content)

See thecomplete code

Disco Recommendations

You can use Neighbor for online item-based recommendations withDisco. We’ll use MovieLens data for this example.

Generate a model

rails generate model Movie name:string factors:cuberails db:migrate

And addhas_neighbors

classMovie <ApplicationRecordhas_neighbors:factors,dimensions:20,normalize:trueend

Fit the recommender

data=Disco.load_movielensrecommender=Disco::Recommender.new(factors:20)recommender.fit(data)

Store the item factors

movies=[]recommender.item_ids.eachdo |item_id|movies <<{name:item_id,factors:recommender.item_factors(item_id)}endMovie.create!(movies)

And get similar movies

movie=Movie.find_by(name:"Star Wars (1977)")movie.nearest_neighbors(:factors,distance:"cosine").first(5).map(&:name)

See the complete code forcube andpgvector

History

View thechangelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs andsubmit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/neighbor.gitcd neighborbundle install# Postgrescreatedb neighbor_testbundleexec rake test:postgresql# SQLitebundleexec rake test:sqlite# MariaDBdocker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 mariadb:11.8bundleexec rake test:mariadb# MySQLdocker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9bundleexec rake test:mysql

About

Nearest neighbor search for Rails

Languages

Ruby100.0%

Movatterモバイル変換

License

ankane/neighbor

Folders and files

Latest commit

History

Repository files navigation

Neighbor

Installation

For Postgres

For SQLite

Getting Started

cube

Distance

Dimensions

pgvector

Distance

Dimensions

Indexing

Half-Precision Vectors

Half-Precision Indexing

Binary Vectors

Binary Quantization

Sparse Vectors

MariaDB

Distance

Indexing

Binary Vectors

MySQL

Distance

Binary Vectors

sqlite-vec

Distance

Dimensions

Virtual Tables

Int8 Vectors

Binary Vectors

Examples

OpenAI Embeddings

Cohere Embeddings

Sentence Embeddings

Hybrid Search

Sparse Search

Disco Recommendations

History

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Used by776

Contributors4

Languages

Packages