NotificationsYou must be signed in to change notification settings
Fork352
Star6.6k

Commite6a1619

authored

pgml sdk examples (#669)

1 parent9949cde commite6a1619Copy full SHA for e6a1619

File tree

12 files changed

+324

-253

lines changed

pgml-sdks/python/pgml

12 files changed

+324

-253

lines changed

`‎pgml-sdks/python/pgml/README.md‎`

Lines changed: 15 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,10 +1,14 @@`
`1`		`-#Table of Contents`
	`1`	`+#Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone`
	`2`	`+`
	`3`	`+##Table of Contents`
`2`	`4`
`3`	`5`	`-[Overview](#overview)`
`4`	`6`	`-[Quickstart](#quickstart)`
`5`	`7`	`-[Usage](#usage)`
	`8`	`+-[Examples](./examples/README.md)`
`6`	`9`	`-[Developer setup](#developer-setup)`
`7`	`10`	`-[API Reference](#api-reference)`
	`11`	`+-[Roadmap](#roadmap)`
`8`	`12`
`9`	`13`	`##Overview`
`10`	`14`	`Python SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases. With this SDK, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using PgVector for fast and accurate queries.`
`@@ -274,4 +278,13 @@ LOGLEVEL=INFO python -m unittest tests/test_collection.py`
`274`	`278`	`###API Reference`
`275`	`279`
`276`	`280`	`-[Database](./docs/pgml/database.md)`
`277`		`--[Collection](./docs/pgml/collection.md)`
	`281`	`+-[Collection](./docs/pgml/collection.md)`
	`282`	`+`
	`283`	`+###Roadmap`
	`284`	`+`
	`285`	+- Enable filters on document metadata in`vector_search`.[Issue](https://github.com/postgresml/postgresml/issues/663)
	`286`	+-`text_search` functionality on documents using Postgres text search.[Issue](https://github.com/postgresml/postgresml/issues/664)
	`287`	+-`hybrid_search` functionality that does a combination of`vector_search` and`text_search` in an order specified by the user.[Issue](https://github.com/postgresml/postgresml/issues/665)
	`288`	`+- Ability to call and manage OpenAI embeddings for comparison purposes.[Issue](https://github.com/postgresml/postgresml/issues/666)`
	`289`	+- Save`vector_search` history for downstream monitoring of model performance.[Issue](https://github.com/postgresml/postgresml/issues/667)
	`290`	`+- Perform chunking on the DB with multiple langchain splitters.[Issue](https://github.com/postgresml/postgresml/issues/668)`

`‎pgml-sdks/python/pgml/examples/README.md‎`

Lines changed: 19 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,19 @@`
	`1`	`+##Examples`
	`2`	`+`
	`3`	`+###[Semantic Search](./semantic_search.py)`
	`4`	+This is a basic example to perform semantic search on a collection of documents. It loads the Quora dataset, creates a collection in a PostgreSQL database, upserts documents, generates chunks and embeddings, and then performs a vector search on a query. Embeddings are created using`intfloat/e5-small` model. The results are are semantically similar documemts to the query. Finally, the collection is archived.
	`5`	`+`
	`6`	`+###[Question Answering](./question_answering.py)`
	`7`	`+This is an example to find documents relevant to a question from the collection of documents. It loads the Stanford Question Answering Dataset (SQuAD) into the database, generates chunks and embeddings. Query is passed to vector search to retrieve documents that match closely in the embeddings space. A score is returned with each of the search result.`
	`8`	`+`
	`9`	`+###[Question Answering using Instructore Model](./question_answering_instructor.py)`
	`10`	+In this example, we will use`hknlp/instructor-base` model to build text embeddings instead of the default`intfloat/e5-small` model. We will show how to use`register_model` method and use the`model_id` to build and query embeddings.
	`11`	`+`
	`12`	`+###[Extractive Question Answering](./extractive_question_answering.py)`
	`13`	+In this example, we will show how to use`vector_search` result as a`context` to a HuggingFace question answering model. We will use`pgml.transform` to run the model on the database.
	`14`	`+`
	`15`	`+###[Table Question Answering](./table_question_answering.py)`
	`16`	`+In this example, we will use[Open Table-and-Text Question Answering (OTT-QA)`
	`17`	+](https://github.com/wenhuchen/OTT-QA) dataset to run queries on tables. We will use`deepset/all-mpnet-base-v2-table` model that is trained for embedding tabular data for retrieval tasks.
	`18`	`+`
	`19`	`+`

`‎pgml-sdks/python/pgml/examples/extractive_question_answering.py‎`

Lines changed: 69 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,69 @@`
	`1`	`+frompgmlimportDatabase`
	`2`	`+importos`
	`3`	`+importjson`
	`4`	`+fromdatasetsimportload_dataset`
	`5`	`+fromtimeimporttime`
	`6`	`+fromdotenvimportload_dotenv`
	`7`	`+fromrich.consoleimportConsole`
	`8`	`+frompsycopgimportsql`
	`9`	`+frompgml.dbutilsimportrun_select_statement`
	`10`	`+`
	`11`	`+load_dotenv()`
	`12`	`+console=Console()`
	`13`	`+`
	`14`	`+local_pgml="postgres://postgres@127.0.0.1:5433/pgml_development"`
	`15`	`+`
	`16`	`+conninfo=os.environ.get("PGML_CONNECTION",local_pgml)`
	`17`	`+db=Database(conninfo)`
	`18`	`+`
	`19`	`+collection_name="squad_collection"`
	`20`	`+collection=db.create_or_get_collection(collection_name)`
	`21`	`+`
	`22`	`+`
	`23`	`+data=load_dataset("squad",split="train")`
	`24`	`+data=data.to_pandas()`
	`25`	`+data=data.drop_duplicates(subset=["context"])`
	`26`	`+`
	`27`	`+documents= [`
	`28`	`+ {"id":r["id"],"text":r["context"],"title":r["title"]}`
	`29`	`+forrindata.to_dict(orient="records")`
	`30`	`+]`
	`31`	`+`
	`32`	`+collection.upsert_documents(documents[:200])`
	`33`	`+collection.generate_chunks()`
	`34`	`+collection.generate_embeddings()`
	`35`	`+`
	`36`	`+start=time()`
	`37`	`+query="Who won more than 20 grammy awards?"`
	`38`	`+results=collection.vector_search(query,top_k=5)`
	`39`	`+_end=time()`
	`40`	`+console.print("\nResults for '%s'"% (query),style="bold")`
	`41`	`+console.print(results)`
	`42`	`+console.print("Query time = %0.3f"% (_end-start))`
	`43`	`+`
	`44`	`+# Get the context passage and use pgml.transform to get short answer to the question`
	`45`	`+`
	`46`	`+`
	`47`	`+conn=db.pool.getconn()`
	`48`	`+context=" ".join(results[0]["chunk"].strip().split())`
	`49`	`+context=context.replace('"','\\"').replace("'","''")`
	`50`	`+`
	`51`	`+select_statement="""SELECT pgml.transform(`
	`52`	`+ 'question-answering',`
	`53`	`+ inputs => ARRAY[`
	`54`	`+ '{`
	`55`	`+\"question\":\"%s\",`
	`56`	`+\"context\":\"%s\"`
	`57`	`+ }'`
	`58`	`+ ]`
	`59`	`+) AS answer;"""% (`
	`60`	`+query,`
	`61`	`+context,`
	`62`	`+)`
	`63`	`+`
	`64`	`+results=run_select_statement(conn,select_statement)`
	`65`	`+db.pool.putconn(conn)`
	`66`	`+`
	`67`	`+console.print("\nResults for query '%s'"%query)`
	`68`	`+console.print(results)`
	`69`	`+db.archive_collection(collection_name)`

`‎pgml-sdks/python/pgml/examples/vector_search.py‎renamed to ‎pgml-sdks/python/pgml/examples/question_answering.py‎`

Lines changed: 14 additions & 6 deletions

Original file line number	Diff line number	Diff line change
`@@ -3,14 +3,18 @@`
`3`	`3`	`importjson`
`4`	`4`	`fromdatasetsimportload_dataset`
`5`	`5`	`fromtimeimporttime`
`6`		`-fromrichimportprintasrprint`
	`6`	`+fromdotenvimportload_dotenv`
	`7`	`+fromrich.consoleimportConsole`
	`8`	`+`
	`9`	`+load_dotenv()`
	`10`	`+console=Console()`
`7`	`11`
`8`	`12`	`local_pgml="postgres://postgres@127.0.0.1:5433/pgml_development"`
`9`	`13`
`10`	`14`	`conninfo=os.environ.get("PGML_CONNECTION",local_pgml)`
`11`	`15`	`db=Database(conninfo)`
`12`	`16`
`13`		`-collection_name="test_pgml_sdk_1"`
	`17`	`+collection_name="squad_collection"`
`14`	`18`	`collection=db.create_or_get_collection(collection_name)`
`15`	`19`
`16`	`20`
`@@ -19,7 +23,7 @@`
`19`	`23`	`data=data.drop_duplicates(subset=["context"])`
`20`	`24`
`21`	`25`	`documents= [`
`22`		`- {'id':r['id'],"text":r["context"],"title":r["title"]}`
	`26`	`+ {"id":r["id"],"text":r["context"],"title":r["title"]}`
`23`	`27`	`forrindata.to_dict(orient="records")`
`24`	`28`	`]`
`25`	`29`
`@@ -28,7 +32,11 @@`
`28`	`32`	`collection.generate_embeddings()`
`29`	`33`
`30`	`34`	`start=time()`
`31`		`-results=collection.vector_search("Who won 20 grammy awards?",top_k=2)`
`32`		`-rprint("Query time %0.3f"%(time()-start))`
`33`		`-rprint(json.dumps(results,indent=2))`
	`35`	`+query="Who won 20 grammy awards?"`
	`36`	`+results=collection.vector_search(query,top_k=5)`
	`37`	`+_end=time()`
	`38`	`+console.print("\nResults for '%s'"% (query),style="bold")`
	`39`	`+console.print(results)`
	`40`	`+console.print("Query time = %0.3f"% (_end-start))`
	`41`	`+`
`34`	`42`	`db.archive_collection(collection_name)`

`‎pgml-sdks/python/pgml/examples/question_answering_instructor.py‎`

Lines changed: 55 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,55 @@`
	`1`	`+frompgmlimportDatabase`
	`2`	`+importos`
	`3`	`+importjson`
	`4`	`+fromdatasetsimportload_dataset`
	`5`	`+fromtimeimporttime`
	`6`	`+fromdotenvimportload_dotenv`
	`7`	`+fromrich.consoleimportConsole`
	`8`	`+`
	`9`	`+load_dotenv()`
	`10`	`+console=Console()`
	`11`	`+`
	`12`	`+local_pgml="postgres://postgres@127.0.0.1:5433/pgml_development"`
	`13`	`+`
	`14`	`+conninfo=os.environ.get("PGML_CONNECTION",local_pgml)`
	`15`	`+db=Database(conninfo)`
	`16`	`+`
	`17`	`+collection_name="squad_collection"`
	`18`	`+collection=db.create_or_get_collection(collection_name)`
	`19`	`+`
	`20`	`+`
	`21`	`+data=load_dataset("squad",split="train")`
	`22`	`+data=data.to_pandas()`
	`23`	`+data=data.drop_duplicates(subset=["context"])`
	`24`	`+`
	`25`	`+documents= [`
	`26`	`+ {"id":r["id"],"text":r["context"],"title":r["title"]}`
	`27`	`+forrindata.to_dict(orient="records")`
	`28`	`+]`
	`29`	`+`
	`30`	`+collection.upsert_documents(documents[:200])`
	`31`	`+collection.generate_chunks()`
	`32`	`+`
	`33`	`+# register instructor model`
	`34`	`+model_id=collection.register_model(`
	`35`	`+model_name="hkunlp/instructor-base",`
	`36`	`+model_params={"instruction":"Represent the Wikipedia document for retrieval: "},`
	`37`	`+)`
	`38`	`+collection.generate_embeddings(model_id=model_id)`
	`39`	`+`
	`40`	`+start=time()`
	`41`	`+query="Who won 20 grammy awards?"`
	`42`	`+results=collection.vector_search(`
	`43`	`+query,`
	`44`	`+top_k=5,`
	`45`	`+model_id=model_id,`
	`46`	`+query_parameters={`
	`47`	`+"instruction":"Represent the Wikipedia question for retrieving supporting documents: "`
	`48`	`+ },`
	`49`	`+)`
	`50`	`+_end=time()`
	`51`	`+console.print("\nResults for '%s'"% (query),style="bold")`
	`52`	`+console.print(results)`
	`53`	`+console.print("Query time = %0.3f"% (_end-start))`
	`54`	`+`
	`55`	`+db.archive_collection(collection_name)`

`‎pgml-sdks/python/pgml/examples/semantic_search.py‎`

Lines changed: 50 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,50 @@`
	`1`	`+fromdatasetsimportload_dataset`
	`2`	`+frompgmlimportDatabase`
	`3`	`+importos`
	`4`	`+fromrichimportprintasrprint`
	`5`	`+fromdotenvimportload_dotenv`
	`6`	`+fromtimeimporttime`
	`7`	`+fromrich.consoleimportConsole`
	`8`	`+`
	`9`	`+load_dotenv()`
	`10`	`+console=Console()`
	`11`	`+`
	`12`	`+# Prepare Data`
	`13`	`+dataset=load_dataset("quora",split="train")`
	`14`	`+questions= []`
	`15`	`+`
	`16`	`+forrecordindataset["questions"]:`
	`17`	`+questions.extend(record["text"])`
	`18`	`+`
	`19`	`+# remove duplicates`
	`20`	`+documents= []`
	`21`	`+forquestioninlist(set(questions)):`
	`22`	`+ifquestion:`
	`23`	`+documents.append({"text":question})`
	`24`	`+`
	`25`	`+`
	`26`	`+# Get Database connection`
	`27`	`+local_pgml="postgres://postgres@127.0.0.1:5433/pgml_development"`
	`28`	`+conninfo=os.environ.get("PGML_CONNECTION",local_pgml)`
	`29`	`+db=Database(conninfo,min_connections=4)`
	`30`	`+`
	`31`	`+# Create or get collection`
	`32`	`+collection_name="quora_collection"`
	`33`	`+collection=db.create_or_get_collection(collection_name)`
	`34`	`+`
	`35`	`+# Upsert documents, chunk text, and generate embeddings`
	`36`	`+collection.upsert_documents(documents[:200])`
	`37`	`+collection.generate_chunks()`
	`38`	`+collection.generate_embeddings()`
	`39`	`+`
	`40`	`+# Query vector embeddings`
	`41`	`+start=time()`
	`42`	`+query="What is a good mobile os?"`
	`43`	`+result=collection.vector_search(query)`
	`44`	`+_end=time()`
	`45`	`+`
	`46`	`+console.print("\nResults for '%s'"% (query),style="bold")`
	`47`	`+console.print(result)`
	`48`	`+console.print("Query time = %0.3f"% (_end-start))`
	`49`	`+`
	`50`	`+db.archive_collection(collection_name)`

`‎pgml-sdks/python/pgml/examples/table_question_answering.py‎`

Lines changed: 56 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,56 @@`
	`1`	`+frompgmlimportDatabase`
	`2`	`+importos`
	`3`	`+importjson`
	`4`	`+fromdatasetsimportload_dataset`
	`5`	`+fromtimeimporttime`
	`6`	`+fromdotenvimportload_dotenv`
	`7`	`+fromrich.consoleimportConsole`
	`8`	`+fromrich.progressimporttrack`
	`9`	`+frompsycopgimportsql`
	`10`	`+frompgml.dbutilsimportrun_select_statement`
	`11`	`+importpandasaspd`
	`12`	`+`
	`13`	`+load_dotenv()`
	`14`	`+console=Console()`
	`15`	`+`
	`16`	`+local_pgml="postgres://postgres@127.0.0.1:5433/pgml_development"`
	`17`	`+`
	`18`	`+conninfo=os.environ.get("PGML_CONNECTION",local_pgml)`
	`19`	`+db=Database(conninfo)`
	`20`	`+`
	`21`	`+collection_name="ott_qa_20k_collection"`
	`22`	`+collection=db.create_or_get_collection(collection_name)`
	`23`	`+`
	`24`	`+`
	`25`	`+data=load_dataset("ashraq/ott-qa-20k",split="train")`
	`26`	`+documents= []`
	`27`	`+`
	`28`	`+# loop through the dataset and convert tabular data to pandas dataframes`
	`29`	`+fordocintrack(data):`
	`30`	`+table=pd.DataFrame(doc["data"],columns=doc["header"])`
	`31`	`+processed_table="\n".join([table.to_csv(index=False)])`
	`32`	`+documents.append(`
	`33`	`+ {`
	`34`	`+"text":processed_table,`
	`35`	`+"title":doc["title"],`
	`36`	`+"url":doc["url"],`
	`37`	`+"uid":doc["uid"],`
	`38`	`+ }`
	`39`	`+ )`
	`40`	`+`
	`41`	`+collection.upsert_documents(documents)`
	`42`	`+collection.generate_chunks()`
	`43`	`+`
	`44`	`+# SentenceTransformer model trained specifically for embedding tabular data for retrieval tasks`
	`45`	`+model_id=collection.register_model(model_name="deepset/all-mpnet-base-v2-table")`
	`46`	`+collection.generate_embeddings(model_id=model_id)`
	`47`	`+`
	`48`	`+start=time()`
	`49`	`+query="which country has the highest GDP in 2020?"`
	`50`	`+results=collection.vector_search(query,top_k=5,model_id=model_id)`
	`51`	`+_end=time()`
	`52`	`+console.print("\nResults for '%s'"% (query),style="bold")`
	`53`	`+console.print(results)`
	`54`	`+console.print("Query time = %0.3f"% (_end-start))`
	`55`	`+`
	`56`	`+db.archive_collection(collection_name)`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commite6a1619

File tree

12 files changed

12 files changed

`‎pgml-sdks/python/pgml/README.md‎`

`‎pgml-sdks/python/pgml/examples/README.md‎`

`‎pgml-sdks/python/pgml/examples/extractive_question_answering.py‎`

`‎pgml-sdks/python/pgml/examples/vector_search.py‎renamed to ‎pgml-sdks/python/pgml/examples/question_answering.py‎`

`‎pgml-sdks/python/pgml/examples/question_answering_instructor.py‎`

`‎pgml-sdks/python/pgml/examples/semantic_search.py‎`

`‎pgml-sdks/python/pgml/examples/table_question_answering.py‎`

0 commit comments