Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork29
Star249

Comprehensive Vector Data Tooling. The universal interface for all vector database, datasets and RAG platforms. Easily export, import, backup, re-embed (using any model) or access your vector data from any vector databases or repository.

vector-io.com

License

Apache-2.0 license

249 stars 29 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 498 Commits
.github		.github
archive		archive
assets		assets
docs		docs
src/vdf_io		src/vdf_io
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.html		README.html
README.md		README.md
_config.yml		_config.yml
build_and_upload_new_version.py		build_and_upload_new_version.py
funding.json		funding.json
index.html		index.html
requirements.txt		requirements.txt
setup.py		setup.py
sweep.yaml		sweep.yaml

Repository files navigation

This library uses a universal format for vector datasets to easily export and import data from all vector databases.

Request support for a VectorDB by voting/commenting onthis poll

See theContributing section to add support for your favorite vector database.

Supported Vector Databases

Fully Supported

Vector Database	Import	Export
Pinecone	✅	✅
Qdrant	✅	✅
Milvus	✅	✅
GCP Vertex AI Vector Search	✅	✅
KDB.AI	✅	✅
LanceDB	✅	✅
DataStax Astra DB	✅	✅
Chroma	✅	✅
Turbopuffer	✅	✅

Partial

Vector Database	Import	Export

In Progress

Vector Database	Import	Export
Azure AI Search	❌	❌
Weaviate	❌	❌
MongoDB Atlas	❌	❌
OpenSearch	❌	❌
Apache Cassandra	❌	❌
txtai	❌	❌
pgvector	❌	❌
SQLite-VSS	❌	❌

Not Supported

Vector Database	Import	Export
Vespa	❌	❌
Marqo	❌	❌
Elasticsearch	❌	❌
Redis Search	❌	❌
ClickHouse	❌	❌
USearch	❌	❌
Rockset	❌	❌
Epsilla	❌	❌
Activeloop Deep Lake	❌	❌
ApertureDB	❌	❌
CrateDB	❌	❌
Meilisearch	❌	❌
MyScale	❌	❌
Neo4j	❌	❌
Nuclia DB	❌	❌
OramaSearch	❌	❌
Typesense	❌	❌
Anari AI	❌	❌
Vald	❌	❌
Apache Solr	❌	❌

Installation

Using pip

pip install vdf-io

From source

git clone https://github.com/AI-Northstar-Tech/vector-io.gitcd vector-iopip install -r requirements.txt

Universal Vector Dataset Format (VDF) specification

VDF_META.json: It is a json file with the following schema VDFMeta defined insrc/vdf_io/meta_types.py:

classNamespaceMeta(BaseModel):namespace:strindex_name:strtotal_vector_count:intexported_vector_count:intdimensions:intmodel_name:str|None=Nonevector_columns:List[str]= ["vector"]data_path:strmetric:str|None=Noneindex_config:Optional[Dict[Any,Any]]=Noneschema_dict:Optional[Dict[str,Any]]=NoneclassVDFMeta(BaseModel):version:strfile_structure:List[str]author:strexported_from:strindexes:Dict[str,List[NamespaceMeta]]exported_at:strid_column:Optional[str]=None

Parquet files/folders for metadata and vectors.

Export Script

export_vdf --helpusage: export_vdf [-h] [-m MODEL_NAME]                  [--max_file_size MAX_FILE_SIZE]                  [--push_to_hub| --no-push_to_hub]                  [--public| --no-public]                  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}                  ...Export data from various vector databases to the VDF formatfor vector datasetsoptions:  -h, --help            show thishelp message andexit  -m MODEL_NAME, --model_name MODEL_NAME                        Name of model used  --max_file_size MAX_FILE_SIZE                        Maximum file sizein MB (default:                        1024)  --push_to_hub, --no-push_to_hub                        Push to hub  --public, --no-public                        Make dataset public (default:                        False)Vector Databases:  Choose the vectors database toexport data from  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}    pinecone            Export data from Pinecone    qdrant              Export data from Qdrant    kdbai               Export data from KDB.AI    milvus              Export data from Milvus    vertexai_vectorsearch                        Export data from Vertex AI Vector                        Search

Import script

import_vdf --helpusage: import_vdf [-h] [-d DIR] [-s| --subset| --no-subset]                  [--create_new| --no-create_new]                  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}                  ...Import data from VDF to a vector databaseoptions:  -h, --help            show thishelp message andexit  -d DIR, --dir DIR     Directory to import  -s, --subset, --no-subset                        Import a subset of data (default: False)  --create_new, --no-create_new                        Create a new index (default: False)Vector Databases:  Choose the vectors database toexport data from  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}    milvus              Import data to Milvus    pinecone            Import data to Pinecone    qdrant              Import data to Qdrant    vertexai_vectorsearch                        Import data to Vertex AI Vector Search    kdbai               Import data to KDB.AI

Re-embed script

This Python script is used to re-embed a vector dataset. It takes a directory of vector dataset in the VDF format and re-embeds it using a new model. The script also allows you to specify the name of the column containing text to be embedded.

reembed_vdf --helpusage: reembed_vdf [-h] -d DIR [-m NEW_MODEL_NAME]                  [-t TEXT_COLUMN]Reembed a vector datasetoptions:  -h, --help            show thishelp message andexit  -d DIR, --dir DIR     Directory of vector datasetin                        the VDF format  -m NEW_MODEL_NAME, --new_model_name NEW_MODEL_NAME                        Name of new model to be used  -t TEXT_COLUMN, --text_column TEXT_COLUMN                        Name of the column containing                        text to be embedded

Examples

export_vdf -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starterimport_vdf -d /path/to/vdf/dataset milvusreembed_vdf -d /path/to/vdf/dataset -m sentence-transformers/all-MiniLM-L6-v2 -t title

Follow the prompt to select the index and id range to export.

Contributing

Adding a new vector database

If you wish to add an import/export implementation for a new vector database, you must also implement the other side of the import/export for the same database.Please fork the repo and send a PR for both the import and export scripts.

Steps to add a new vector database (ABC):

Add your database name insrc/vdf_io/names.py in the DBNames enum class.
Create new filessrc/vdf_io/export_vdf/export_abc.py andsrc/vdf_io/import_vdf/import_abc.py for the new DB.

Export:

In your export file, define a class ExportABC which inherits from ExportVDF.
Specify a DB_NAME_SLUG for the class
The class should implement:
1. make_parser() function to add database specific arguments to the export_vdf CLI
2. export_vdb() function to prompt user for info not provided in the CLI. It should then call the get_data() function.
3. get_data() function to download points (in a batched manner) with all the metadata from the specified index of the vector database. This data should be stored in a series of parquet files/folders. The metadata should be stored in a json file with theschema above.
Use the script to export data from an example index of the vector database and verify that the data is exported correctly.

Import:

In your import file, define a class ImportABC which inherits from ImportVDF.
Specify a DB_NAME_SLUG for the class
The class should implement:
1. make_parser() function to add database specific arguments to the import_vdf CLI, such as the url of the database, any authentication tokens, etc.
2. import_vdb() function to prompt user for info not provided in the CLI. It should then call the upsert_data() function.
3. upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read from the VDF_META.json file in the vdf folder.
Use the script to import data from the example vdf dataset exported in the previous step and verify that the data is imported correctly.

Changing the VDF specification

If you wish to change the VDF specification, please open an issue to discuss the change before sending a PR.

Efficiency improvements

If you wish to improve the efficiency of the import/export scripts, please fork the repo and send a PR.

Telemetry

Running the scripts in the repo will send anonymous usage data to AI Northstar Tech to help improve the library.

You can opt out this by setting the environment variableDISABLE_TELEMETRY_VECTORIO to1.

Questions

If you have any questions, please open an issue on the repo or message Dhruv Anand onLinkedIn

Contributors

_{Dhruv Anand}
💻🐛📖

_{Jayesh Rathi}
💻

_{Jordan Totten}
💻

About

vector-io.com

Sponsor this project

https://buy.stripe.com/aEU6p89JpefG8tW5km

Movatterモバイル変換

Uh oh!

License

AI-Northstar-Tech/vector-io

Folders and files

Latest commit

History

Repository files navigation

Supported Vector Databases

Installation

Using pip

From source

Universal Vector Dataset Format (VDF) specification

Export Script

Import script

Re-embed script

Examples

Contributing

Adding a new vector database

Changing the VDF specification

Efficiency improvements

Telemetry

Questions

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors12

Uh oh!

Languages