- Notifications
You must be signed in to change notification settings - Fork27
Comprehensive Vector Data Tooling. The universal interface for all vector database, datasets and RAG platforms. Easily export, import, backup, re-embed (using any model) or access your vector data from any vector databases or repository.
License
AI-Northstar-Tech/vector-io
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This library uses a universal format for vector datasets to easily export and import data from all vector databases.
Request support for a VectorDB by voting/commenting onthis poll
See theContributing section to add support for your favorite vector database.
Fully Supported
Vector Database | Import | Export |
---|---|---|
Pinecone | ✅ | ✅ |
Qdrant | ✅ | ✅ |
Milvus | ✅ | ✅ |
GCP Vertex AI Vector Search | ✅ | ✅ |
KDB.AI | ✅ | ✅ |
LanceDB | ✅ | ✅ |
DataStax Astra DB | ✅ | ✅ |
Chroma | ✅ | ✅ |
Turbopuffer | ✅ | ✅ |
Partial
Vector Database | Import | Export |
---|
In Progress
Vector Database | Import | Export |
---|---|---|
Azure AI Search | ❌ | ❌ |
Weaviate | ❌ | ❌ |
MongoDB Atlas | ❌ | ❌ |
OpenSearch | ❌ | ❌ |
Apache Cassandra | ❌ | ❌ |
txtai | ❌ | ❌ |
pgvector | ❌ | ❌ |
SQLite-VSS | ❌ | ❌ |
Not Supported
Vector Database | Import | Export |
---|---|---|
Vespa | ❌ | ❌ |
Marqo | ❌ | ❌ |
Elasticsearch | ❌ | ❌ |
Redis Search | ❌ | ❌ |
ClickHouse | ❌ | ❌ |
USearch | ❌ | ❌ |
Rockset | ❌ | ❌ |
Epsilla | ❌ | ❌ |
Activeloop Deep Lake | ❌ | ❌ |
ApertureDB | ❌ | ❌ |
CrateDB | ❌ | ❌ |
Meilisearch | ❌ | ❌ |
MyScale | ❌ | ❌ |
Neo4j | ❌ | ❌ |
Nuclia DB | ❌ | ❌ |
OramaSearch | ❌ | ❌ |
Typesense | ❌ | ❌ |
Anari AI | ❌ | ❌ |
Vald | ❌ | ❌ |
Apache Solr | ❌ | ❌ |
pip install vdf-io
git clone https://github.com/AI-Northstar-Tech/vector-io.gitcd vector-iopip install -r requirements.txt
- VDF_META.json: It is a json file with the following schema VDFMeta defined insrc/vdf_io/meta_types.py:
classNamespaceMeta(BaseModel):namespace:strindex_name:strtotal_vector_count:intexported_vector_count:intdimensions:intmodel_name:str|None=Nonevector_columns:List[str]= ["vector"]data_path:strmetric:str|None=Noneindex_config:Optional[Dict[Any,Any]]=Noneschema_dict:Optional[Dict[str,Any]]=NoneclassVDFMeta(BaseModel):version:strfile_structure:List[str]author:strexported_from:strindexes:Dict[str,List[NamespaceMeta]]exported_at:strid_column:Optional[str]=None
- Parquet files/folders for metadata and vectors.
export_vdf --helpusage: export_vdf [-h] [-m MODEL_NAME] [--max_file_size MAX_FILE_SIZE] [--push_to_hub| --no-push_to_hub] [--public| --no-public] {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch} ...Export data from various vector databases to the VDF formatfor vector datasetsoptions: -h, --help show thishelp message andexit -m MODEL_NAME, --model_name MODEL_NAME Name of model used --max_file_size MAX_FILE_SIZE Maximum file sizein MB (default: 1024) --push_to_hub, --no-push_to_hub Push to hub --public, --no-public Make dataset public (default: False)Vector Databases: Choose the vectors database toexport data from {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch} pinecone Export data from Pinecone qdrant Export data from Qdrant kdbai Export data from KDB.AI milvus Export data from Milvus vertexai_vectorsearch Export data from Vertex AI Vector Search
import_vdf --helpusage: import_vdf [-h] [-d DIR] [-s| --subset| --no-subset] [--create_new| --no-create_new] {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai} ...Import data from VDF to a vector databaseoptions: -h, --help show thishelp message andexit -d DIR, --dir DIR Directory to import -s, --subset, --no-subset Import a subset of data (default: False) --create_new, --no-create_new Create a new index (default: False)Vector Databases: Choose the vectors database toexport data from {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai} milvus Import data to Milvus pinecone Import data to Pinecone qdrant Import data to Qdrant vertexai_vectorsearch Import data to Vertex AI Vector Search kdbai Import data to KDB.AI
This Python script is used to re-embed a vector dataset. It takes a directory of vector dataset in the VDF format and re-embeds it using a new model. The script also allows you to specify the name of the column containing text to be embedded.
reembed_vdf --helpusage: reembed_vdf [-h] -d DIR [-m NEW_MODEL_NAME] [-t TEXT_COLUMN]Reembed a vector datasetoptions: -h, --help show thishelp message andexit -d DIR, --dir DIR Directory of vector datasetin the VDF format -m NEW_MODEL_NAME, --new_model_name NEW_MODEL_NAME Name of new model to be used -t TEXT_COLUMN, --text_column TEXT_COLUMN Name of the column containing text to be embedded
export_vdf -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starterimport_vdf -d /path/to/vdf/dataset milvusreembed_vdf -d /path/to/vdf/dataset -m sentence-transformers/all-MiniLM-L6-v2 -t title
Follow the prompt to select the index and id range to export.
If you wish to add an import/export implementation for a new vector database, you must also implement the other side of the import/export for the same database.Please fork the repo and send a PR for both the import and export scripts.
Steps to add a new vector database (ABC):
- Add your database name insrc/vdf_io/names.py in the DBNames enum class.
- Create new files
src/vdf_io/export_vdf/export_abc.py
andsrc/vdf_io/import_vdf/import_abc.py
for the new DB.
Export:
- In your export file, define a class ExportABC which inherits from ExportVDF.
- Specify a DB_NAME_SLUG for the class
- The class should implement:
- make_parser() function to add database specific arguments to the export_vdf CLI
- export_vdb() function to prompt user for info not provided in the CLI. It should then call the get_data() function.
- get_data() function to download points (in a batched manner) with all the metadata from the specified index of the vector database. This data should be stored in a series of parquet files/folders. The metadata should be stored in a json file with theschema above.
- Use the script to export data from an example index of the vector database and verify that the data is exported correctly.
Import:
- In your import file, define a class ImportABC which inherits from ImportVDF.
- Specify a DB_NAME_SLUG for the class
- The class should implement:
- make_parser() function to add database specific arguments to the import_vdf CLI, such as the url of the database, any authentication tokens, etc.
- import_vdb() function to prompt user for info not provided in the CLI. It should then call the upsert_data() function.
- upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read from the VDF_META.json file in the vdf folder.
- Use the script to import data from the example vdf dataset exported in the previous step and verify that the data is imported correctly.
If you wish to change the VDF specification, please open an issue to discuss the change before sending a PR.
If you wish to improve the efficiency of the import/export scripts, please fork the repo and send a PR.
Running the scripts in the repo will send anonymous usage data to AI Northstar Tech to help improve the library.
You can opt out this by setting the environment variableDISABLE_TELEMETRY_VECTORIO
to1
.
If you have any questions, please open an issue on the repo or message Dhruv Anand onLinkedIn
Dhruv Anand 💻🐛📖 | Jayesh Rathi 💻 | Jordan Totten 💻 |
About
Comprehensive Vector Data Tooling. The universal interface for all vector database, datasets and RAG platforms. Easily export, import, backup, re-embed (using any model) or access your vector data from any vector databases or repository.