Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

pgml Python SDK with vector search support#636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
santiatpml merged 21 commits intomasterfromsanti-pgml-memory-sdk-python
May 23, 2023
Merged
Show file tree
Hide file tree
Changes fromall commits
Commits
Show all changes
21 commits
Select commitHold shift + click to select a range
89c1223
Python SDK init
santiadavaniMay 12, 2023
53dde0f
create collection init
santiatpmlMay 13, 2023
2d685e8
Upsert documents + tests
santiatpmlMay 16, 2023
86ccef8
Creating more tables as part of collection ..
santiatpmlMay 16, 2023
5b5cee4
Register models and text splitters
santiatpmlMay 16, 2023
7b03e01
Refactored run select and added models
santiatpmlMay 17, 2023
2d9202e
Embeddings and vector search
santiatpmlMay 18, 2023
ea19ecc
Incremental updates for chunks and embeddings
santiatpmlMay 19, 2023
dee6e5b
Docstrings for all modules
santiatpmlMay 19, 2023
b7a0495
Minor updates
santiatpmlMay 19, 2023
5186705
Added basic readme with quickstart
santiatpmlMay 19, 2023
5c8cf62
Updated readme with PGML_CONNECTION
santiatpmlMay 19, 2023
5a81918
Updated readme
santiatpmlMay 19, 2023
a5d1618
Minor API and notebook updates
santiatpmlMay 19, 2023
368da8a
Using document_id, chunk_id etc. for column names
santiatpmlMay 22, 2023
986b314
Renaming model -> model_id and splitter -> splitter_id
santiatpmlMay 22, 2023
4601edc
Performance improvements
santiadavaniMay 23, 2023
988ea41
delete collection is replaced with archive collection
santiadavaniMay 23, 2023
97ec30b
Support for uuids without dashes
santiadavaniMay 23, 2023
4793563
Refactored upsert documents
santiadavaniMay 23, 2023
998c996
Merge branch 'master' into santi-pgml-memory-sdk-python
santiatpmlMay 23, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletionspgml-sdks/python/pgml/README.md
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
# PostgresML Python SDK
This Python SDK provides an easy interface to use PostgresML generative AI capabilities.

## Table of Contents

- [Quickstart](#quickstart)

### Quickstart
1. Install Python 3.11. SDK should work for Python >=3.8. However, at this time, we have only tested Python 3.11.
2. Clone the repository and checkout the SDK branch (before PR)
```
git clone https://github.com/postgresml/postgresml
cd postgresml
git checkout santi-pgml-memory-sdk-python
cd pgml-sdks/python/pgml
```
3. Install poetry `pip install poetry`
4. Initialize Python environment

```
poetry env use python3.11
poetry shell
poetry install
poetry build
```
5. SDK uses your local PostgresML database by default
`postgres://postgres@127.0.0.1:5433/pgml_development`

If it is not up to date with `pgml.embed` please [signup for a free database](https://postgresml.org/signup) and set `PGML_CONNECTION` environment variable with serverless hosted database.

```
export PGML_CONNECTION="postgres://<username>:<password>@<hostname>:<port>/pgm<database>"
```
6. Run a **vector search** example
```
python examples/vector_search.py
```

236 changes: 236 additions & 0 deletionspgml-sdks/python/pgml/examples/vector_search.ipynb
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pgml import Database\n",
"import os\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_pgml = \"postgres://postgres@127.0.0.1:5433/pgml_development\"\n",
"\n",
"conninfo = os.environ.get(\"PGML_CONNECTION\",local_pgml)\n",
"db = Database(conninfo,min_connections=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection_name = \"test_pgml_sdk_1\"\n",
"collection = db.create_or_get_collection(collection_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"data = load_dataset(\"squad\", split=\"train\")\n",
"data = data.to_pandas()\n",
"data.head()\n",
"\n",
"data = data.drop_duplicates(subset=[\"context\"])\n",
"print(len(data))\n",
"data.head()\n",
"\n",
"documents = [\n",
" {\n",
" 'text': r['context'],\n",
" 'metadata': {\n",
" 'title': r['title']\n",
" }\n",
" } for r in data.to_dict(orient='records')\n",
"]\n",
"documents[:3]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.upsert_documents(documents[0:200])\n",
"collection.generate_chunks()\n",
"collection.generate_embeddings()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = collection.vector_search(\"Who won 20 Grammy awards?\", top_k=2)\n",
"print(json.dumps(results,indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.register_model(model_name=\"paraphrase-MiniLM-L6-v2\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.get_models()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(json.dumps(collection.get_models(),indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.generate_embeddings(model_id=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = collection.vector_search(\"Who won 20 Grammy awards?\", top_k=2, model_id=2)\n",
"print(json.dumps(results,indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.register_model(model_name=\"hkunlp/instructor-xl\", model_params={\"instruction\": \"Represent the Wikipedia document for retrieval: \"})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.get_models()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.generate_embeddings(model_id=3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = collection.vector_search(\"Who won 20 Grammy awards?\", top_k=2, model_id=3, query_parameters={\"instruction\": \"Represent the Wikipedia question for retrieving supporting documents: \"})\n",
"print(json.dumps(results,indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.register_text_splitter(splitter_name=\"RecursiveCharacterTextSplitter\",splitter_params={\"chunk_size\": 100,\"chunk_overlap\": 20})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.generate_chunks(splitter_id=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collection.generate_embeddings(splitter_id=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = collection.vector_search(\"Who won 20 Grammy awards?\", top_k=2, splitter_id=2)\n",
"print(json.dumps(results,indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db.delete_collection(collection_name)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pgml-zoggicR5-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
34 changes: 34 additions & 0 deletionspgml-sdks/python/pgml/examples/vector_search.py
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
from pgml import Database
import os
import json
from datasets import load_dataset
from time import time
from rich import print as rprint

local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"

conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
db = Database(conninfo)

collection_name = "test_pgml_sdk_1"
collection = db.create_or_get_collection(collection_name)


data = load_dataset("squad", split="train")
data = data.to_pandas()
data = data.drop_duplicates(subset=["context"])

documents = [
{'id': r['id'], "text": r["context"], "title": r["title"]}
for r in data.to_dict(orient="records")
]

collection.upsert_documents(documents[:200])
collection.generate_chunks()
collection.generate_embeddings()

start = time()
results = collection.vector_search("Who won 20 grammy awards?", top_k=2)
rprint(json.dumps(results, indent=2))
rprint("Query time %0.3f"%(time()-start))
db.archive_collection(collection_name)
7 changes: 7 additions & 0 deletionspgml-sdks/python/pgml/pgml/__init__.py
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
from .database import Database
from .collection import Collection
from .dbutils import (
run_create_or_insert_statement,
run_select_statement,
run_drop_or_delete_statement,
)
Loading

[8]ページ先頭

©2009-2025 Movatter.jp