Mar 18, 2024 · Mar 18, 2024 · Mar 18, 2024 · Mar 19, 2024
diff --git a/pgml-sdks/pgml/python/examples/README.md b/pgml-sdks/pgml/python/examples/README.md
 # Examples

 ## Prerequisites
 Before running any examples first install dependencies and set theDATABASE_URL environment variable:
 Before running any examples first install dependencies and set thePGML_DATABASE_URL environment variable:
 ```
 pip install -r requirements.txt
 exportDATABASE_URL={YOUR DATABASE URL}
 exportPGML_DATABASE_URL={YOUR DATABASE URL}
 ```

 Optionally, configure a .env file containing a DATABASE_URL variable.

 ## [Summarizing Question Answering](./summarizing_question_answering.py)
 This is an example to find documents relevant to a question from the collection of documents and then summarize those documents.

 ## [Load Data](./load_data.py)
 This is a simple example to show best practices for upserting data to a collection.

 ## [Offline Summarization](./offline_summarization.py)
 This sample shows how to perform summarization over documents that have been added to a collection using SQL.
diff --git a/pgml-sdks/pgml/python/examples/data/example_data.csv b/pgml-sdks/pgml/python/examples/data/example_data.csv
 id,url,text
 1,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python was conceived in the late 1980s[40] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which was inspired by SETL,[41] capable of exception handling and interfacing with the Amoeba operating system.[10] Its implementation began in December 1989.[42] Van Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his ""permanent vacation"" from his responsibilities as Python's ""benevolent dictator for life"", a title the Python community bestowed upon him to reflect his long-term commitment as the project's chief decision-maker.[43] In January 2019, active Python core developers elected a five-member Steering Council to lead the project."
 2,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
 3,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 2.0 was released on 16 October 2000, with many major new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support.[46] Python 3.0, released on 3 December 2008, with many of its major features backported to Python 2.6.x[47] and 2.7.x. Releases of Python 3 include the 2to3 utility, which automates the translation of Python 2 code to Python 3.[48]"
 4,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 2.7's end-of-life was initially set for 2015, then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.[49][50] No further security patches or other improvements will be released for it.[51][52] Currently only 3.8 and later are supported (2023 security issues were fixed in e.g. 3.7.17, the final 3.7.x release[53]). While Python 2.7 and older is officially unsupported, a different unofficial Python implementation, PyPy, continues to support Python 2, i.e. ""2.7.18+"" (plus 3.9 and 3.10), with the plus meaning (at least some) ""backported security updates"".[54]"
 5,"https://en.wikipedia.org/wiki/Python_(programming_language)","In 2021 (and again twice in 2022), security updates were expedited, since all Python versions were insecure (including 2.7[55]) because of security issues leading to possible remote code execution[56] and web-cache poisoning.[57] In 2022, Python 3.10.4 and 3.9.12 were expedited[58] and 3.8.13, because of many security issues.[59] When Python 3.9.13 was released in May 2022, it was announced that the 3.9 series (joining the older series 3.8 and 3.7) would only receive security fixes in the future.[60] On 7 September 2022, four new releases were made due to a potential denial-of-service attack: 3.10.7, 3.9.14, 3.8.14, and 3.7.14.[61][62]"
 6,"https://en.wikipedia.org/wiki/Python_(programming_language)","As of October 2023, Python 3.12 is the stable release, and 3.12 and 3.11 are the only versions with active (as opposed to just security) support. Notable changes in 3.11 from 3.10 include increased program execution speed and improved error reporting.[63]"
 7,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 3.12 adds syntax (and in fact every Python since at least 3.5 adds some syntax) to the language, the new (soft) keyword type (recent releases have added a lot of typing support e.g. new type union operator in 3.10), and 3.11 for exception handling, and 3.10 the match and case (soft) keywords, for structural pattern matching statements. Python 3.12 also drops outdated modules and functionality, and future versions will too, see below in Development section."
 8,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 3.11 claims to be between 10 and 60% faster than Python 3.10, and Python 3.12 adds another 5% on top of that. It also has improved error messages, and many other changes."
 9,"https://en.wikipedia.org/wiki/Python_(programming_language)","Since 27 June 2023, Python 3.8 is the oldest supported version of Python (albeit in the 'security support' phase), due to Python 3.7 reaching end-of-life.[64]"
 10,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python is a multi-paradigm programming language. Object-oriented programming and structured programming are fully supported, and many of their features support functional programming and aspect-oriented programming (including metaprogramming[65] and metaobjects).[66] Many other paradigms are supported via extensions, including design by contract[67][68] and logic programming.[69]"
diff --git a/pgml-sdks/pgml/python/examples/load_data.py b/pgml-sdks/pgml/python/examples/load_data.py
 import asyncio
 from pgml import Collection, Pipeline
 import pandas as pd
 from dotenv import load_dotenv

 load_dotenv()

 # Initialize Collection
 collection = Collection("load_data_demo")

 # Iniitalize Pipeline
 pipeline = Pipeline(
    "v1",
    {
        "text": {
            "splitter": {"model": "recursive_character"},
            "semantic_search": {"model": "intfloat/e5-small"},
        }
    },
 )


 async def init_collection():
    await collection.add_pipeline(pipeline)


 def load_documents():
    # This can be any loading function. For our case, we will be loading in a CSV
    # The important piece is that our upsert_documents wants an array of dictionaries
    data = pd.read_csv("./data/example_data.csv")
    return data.to_dict("records")


 async def main():
    # We only ever need to add a Pipeline once
    await init_collection()

    # Get our documents. Documents are just dictionaries with at least the `id` key
    # E.G. {"id": "document_one, "text": "here is some text"}
    documents = load_documents()

    # This does the actual uploading of our documents
    # It handles uploading in batches and guarantees that any documents uploaded are
    # split and embedded according to our Pipeline definition above
    await collection.upsert_documents(documents)

    # The default batch size is 100, but we can override that if we have thousands or
    # millions of documents to upload it will be faster with a larger batch size
    # await collection.upsert_documents(documents, {"batch_size": 1000})

    # Now we can search over our collection or do whatever else we want
    # See other examples for more information on searching


 asyncio.run(main())
diff --git a/pgml-sdks/pgml/python/examples/offline_summarization.py b/pgml-sdks/pgml/python/examples/offline_summarization.py
 from pgml import Collection, Pipeline
 import psycopg2
 import asyncio
 import pandas as pd
 from dotenv import load_dotenv
 import os

 load_dotenv()
 db_url = os.environ['PGML_DATABASE_URL']

 # Initialize Collection
 collection = Collection("summary_demo")

 # Iniitalize Pipeline
 pipeline = Pipeline(
    "v1",
    {
        "text": {
            "splitter": {"model": "recursive_character"},
            "semantic_search": {"model": "intfloat/e5-small"},
        }
    },
 )


 async def init_collection():
    await collection.add_pipeline(pipeline)


 def load_documents():
    # This can be any loading function. For our case, we will be loading in a CSV
    # The important piece is that our upsert_documents wants an array of dictionaries
    data = pd.read_csv("./data/example_data.csv")
    return data.to_dict("records")

 async def main():
    # We only ever need to add a Pipeline once
    await init_collection()

    # Get our documents. Documents are just dictionaries with at least the `id` key
    # E.G. {"id": "document_one, "text": "here is some text"}
    documents = load_documents()

    # This does the actual uploading of our documents
    # It handles uploading in batches and guarantees that any documents uploaded are
    # split and embedded according to our Pipeline definition above
    await collection.upsert_documents(documents)

    # Now that we have the documents in our database, let's do some summarization
    conn = psycopg2.connect(db_url)
    cur = conn.cursor()

    # First lets create our summary table
    # This table has three columns:
    # id - auto incrementing primary key
    # document_id - the document id we summarized
    # summary - the summary of the text of the document
    # version - the document `text` key version. This is really just
    # the hash of the `text` column. Stored in the version column of the documents table.
    # We store this column so we don't recompute summaries
    cur.execute("""
        CREATE TABLE IF NOT EXISTS summary_demo.summaries (
            id SERIAL PRIMARY KEY,
            document_id INTEGER REFERENCES summary_demo.documents (id) UNIQUE,
            summary TEXT,
            version VARCHAR(32)
        )
        """)
    conn.commit()

    # Now let's fill up our summary table
    # This query is very efficient as it only updates the summary for documents not currently in
    # the table, or whos `text` key has changed since the last summary
    cur.execute("""
        INSERT INTO summary_demo.summaries (document_id, summary, version)
        SELECT
            sdd.id, (
                SELECT transform[0]->>'summary_text' FROM pgml.transform(
                  task   => '{
                    "task": "summarization",
                    "model": "google/pegasus-xsum"
                  }'::JSONB,
                  inputs => ARRAY[
                    sdd.document->>'text'
                  ]
                )
            ),
            sdd.version->'text'->>'md5'
        FROM summary_demo.documents sdd
        LEFT OUTER JOIN summary_demo.summaries as sds ON sds.document_id = sdd.id
        WHERE sds.document_id IS NULL OR sds.version != sdd.version->'text'->>'md5'
        ON CONFLICT (document_id) DO UPDATE SET version = EXCLUDED.version, summary = EXCLUDED.summary
        """)
    conn.commit()

    # Let's see what our summaries are
    cur.execute("SELECT * FROM summary_demo.summaries")
    for row in cur.fetchall():
        print(row)

    # Purposefully not removing the tables and collection so they can be inspected in the database


 asyncio.run(main())
diff --git a/pgml-sdks/pgml/python/examples/requirements.txt b/pgml-sdks/pgml/python/examples/requirements.txt
 aiohttp==3.8.5
 aiosignal==1.3.1
 async-timeout==4.0.3
 psycopg2==2.9.9
 attrs==23.1.0
 certifi==2023.7.22
 charset-normalizer==3.2.0
Original file line number	Diff line number	Diff line change
		@@ -1,10 +1,10 @@
		# Examples

		## Prerequisites
		Before running any examples first install dependencies and set theDATABASE_URL environment variable:
		Before running any examples first install dependencies and set thePGML_DATABASE_URL environment variable:
		```
		pip install -r requirements.txt
		exportDATABASE_URL={YOUR DATABASE URL}
		exportPGML_DATABASE_URL={YOUR DATABASE URL}
		```

		Optionally, configure a .env file containing a DATABASE_URL variable.
Expand All		@@ -26,3 +26,9 @@ In this example, we will use [Open Table-and-Text Question Answering (OTT-QA)](h

		## [Summarizing Question Answering](./summarizing_question_answering.py)
		This is an example to find documents relevant to a question from the collection of documents and then summarize those documents.

		## [Load Data](./load_data.py)
		This is a simple example to show best practices for upserting data to a collection.

		## [Offline Summarization](./offline_summarization.py)
		This sample shows how to perform summarization over documents that have been added to a collection using SQL.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,11 @@
		id,url,text
		1,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python was conceived in the late 1980s[40] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which was inspired by SETL,[41] capable of exception handling and interfacing with the Amoeba operating system.[10] Its implementation began in December 1989.[42] Van Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his ""permanent vacation"" from his responsibilities as Python's ""benevolent dictator for life"", a title the Python community bestowed upon him to reflect his long-term commitment as the project's chief decision-maker.[43] In January 2019, active Python core developers elected a five-member Steering Council to lead the project."
		2,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
		3,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 2.0 was released on 16 October 2000, with many major new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support.[46] Python 3.0, released on 3 December 2008, with many of its major features backported to Python 2.6.x[47] and 2.7.x. Releases of Python 3 include the 2to3 utility, which automates the translation of Python 2 code to Python 3.[48]"
		4,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 2.7's end-of-life was initially set for 2015, then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.[49][50] No further security patches or other improvements will be released for it.[51][52] Currently only 3.8 and later are supported (2023 security issues were fixed in e.g. 3.7.17, the final 3.7.x release[53]). While Python 2.7 and older is officially unsupported, a different unofficial Python implementation, PyPy, continues to support Python 2, i.e. ""2.7.18+"" (plus 3.9 and 3.10), with the plus meaning (at least some) ""backported security updates"".[54]"
		5,"https://en.wikipedia.org/wiki/Python_(programming_language)","In 2021 (and again twice in 2022), security updates were expedited, since all Python versions were insecure (including 2.7[55]) because of security issues leading to possible remote code execution[56] and web-cache poisoning.[57] In 2022, Python 3.10.4 and 3.9.12 were expedited[58] and 3.8.13, because of many security issues.[59] When Python 3.9.13 was released in May 2022, it was announced that the 3.9 series (joining the older series 3.8 and 3.7) would only receive security fixes in the future.[60] On 7 September 2022, four new releases were made due to a potential denial-of-service attack: 3.10.7, 3.9.14, 3.8.14, and 3.7.14.[61][62]"
		6,"https://en.wikipedia.org/wiki/Python_(programming_language)","As of October 2023, Python 3.12 is the stable release, and 3.12 and 3.11 are the only versions with active (as opposed to just security) support. Notable changes in 3.11 from 3.10 include increased program execution speed and improved error reporting.[63]"
		7,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 3.12 adds syntax (and in fact every Python since at least 3.5 adds some syntax) to the language, the new (soft) keyword type (recent releases have added a lot of typing support e.g. new type union operator in 3.10), and 3.11 for exception handling, and 3.10 the match and case (soft) keywords, for structural pattern matching statements. Python 3.12 also drops outdated modules and functionality, and future versions will too, see below in Development section."
		8,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python 3.11 claims to be between 10 and 60% faster than Python 3.10, and Python 3.12 adds another 5% on top of that. It also has improved error messages, and many other changes."
		9,"https://en.wikipedia.org/wiki/Python_(programming_language)","Since 27 June 2023, Python 3.8 is the oldest supported version of Python (albeit in the 'security support' phase), due to Python 3.7 reaching end-of-life.[64]"
		10,"https://en.wikipedia.org/wiki/Python_(programming_language)","Python is a multi-paradigm programming language. Object-oriented programming and structured programming are fully supported, and many of their features support functional programming and aspect-oriented programming (including metaprogramming[65] and metaobjects).[66] Many other paradigms are supported via extensions, including design by contract[67][68] and logic programming.[69]"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,55 @@
		import asyncio
		from pgml import Collection, Pipeline
		import pandas as pd
		from dotenv import load_dotenv

		load_dotenv()

		# Initialize Collection
		collection = Collection("load_data_demo")

		# Iniitalize Pipeline
		pipeline = Pipeline(
		"v1",
		{
		"text": {
		"splitter": {"model": "recursive_character"},
		"semantic_search": {"model": "intfloat/e5-small"},
		}
		},
		)


		async def init_collection():
		await collection.add_pipeline(pipeline)


		def load_documents():
		# This can be any loading function. For our case, we will be loading in a CSV
		# The important piece is that our upsert_documents wants an array of dictionaries
		data = pd.read_csv("./data/example_data.csv")
		return data.to_dict("records")


		async def main():
		# We only ever need to add a Pipeline once
		await init_collection()

		# Get our documents. Documents are just dictionaries with at least the `id` key
		# E.G. {"id": "document_one, "text": "here is some text"}
		documents = load_documents()

		# This does the actual uploading of our documents
		# It handles uploading in batches and guarantees that any documents uploaded are
		# split and embedded according to our Pipeline definition above
		await collection.upsert_documents(documents)

		# The default batch size is 100, but we can override that if we have thousands or
		# millions of documents to upload it will be faster with a larger batch size
		# await collection.upsert_documents(documents, {"batch_size": 1000})

		# Now we can search over our collection or do whatever else we want
		# See other examples for more information on searching


		asyncio.run(main())
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,104 @@
		from pgml import Collection, Pipeline
		import psycopg2
		import asyncio
		import pandas as pd
		from dotenv import load_dotenv
		import os

		load_dotenv()
		db_url = os.environ['PGML_DATABASE_URL']

		# Initialize Collection
		collection = Collection("summary_demo")

		# Iniitalize Pipeline
		pipeline = Pipeline(
		"v1",
		{
		"text": {
		"splitter": {"model": "recursive_character"},
		"semantic_search": {"model": "intfloat/e5-small"},
		}
		},
		)


		async def init_collection():
		await collection.add_pipeline(pipeline)


		def load_documents():
		# This can be any loading function. For our case, we will be loading in a CSV
		# The important piece is that our upsert_documents wants an array of dictionaries
		data = pd.read_csv("./data/example_data.csv")
		return data.to_dict("records")

		async def main():
		# We only ever need to add a Pipeline once
		await init_collection()

		# Get our documents. Documents are just dictionaries with at least the `id` key
		# E.G. {"id": "document_one, "text": "here is some text"}
		documents = load_documents()

		# This does the actual uploading of our documents
		# It handles uploading in batches and guarantees that any documents uploaded are
		# split and embedded according to our Pipeline definition above
		await collection.upsert_documents(documents)

		# Now that we have the documents in our database, let's do some summarization
		conn = psycopg2.connect(db_url)
		cur = conn.cursor()

		# First lets create our summary table
		# This table has three columns:
		# id - auto incrementing primary key
		# document_id - the document id we summarized
		# summary - the summary of the text of the document
		# version - the document `text` key version. This is really just
		# the hash of the `text` column. Stored in the version column of the documents table.
		# We store this column so we don't recompute summaries
		cur.execute("""
		CREATE TABLE IF NOT EXISTS summary_demo.summaries (
		id SERIAL PRIMARY KEY,
		document_id INTEGER REFERENCES summary_demo.documents (id) UNIQUE,
		summary TEXT,
		version VARCHAR(32)
		)
		""")
		conn.commit()

		# Now let's fill up our summary table
		# This query is very efficient as it only updates the summary for documents not currently in
		# the table, or whos `text` key has changed since the last summary
		cur.execute("""
		INSERT INTO summary_demo.summaries (document_id, summary, version)
		SELECT
		sdd.id, (
		SELECT transform[0]->>'summary_text' FROM pgml.transform(
		task => '{
		"task": "summarization",
		"model": "google/pegasus-xsum"
		}'::JSONB,
		inputs => ARRAY[
		sdd.document->>'text'
		]
		)
		),
		sdd.version->'text'->>'md5'
		FROM summary_demo.documents sdd
		LEFT OUTER JOIN summary_demo.summaries as sds ON sds.document_id = sdd.id
		WHERE sds.document_id IS NULL OR sds.version != sdd.version->'text'->>'md5'
		ON CONFLICT (document_id) DO UPDATE SET version = EXCLUDED.version, summary = EXCLUDED.summary
		""")
		conn.commit()

		# Let's see what our summaries are
		cur.execute("SELECT * FROM summary_demo.summaries")
		for row in cur.fetchall():
		print(row)

		# Purposefully not removing the tables and collection so they can be inspected in the database


		asyncio.run(main())
Original file line number	Diff line number	Diff line change
		@@ -1,6 +1,7 @@
		aiohttp==3.8.5
		aiosignal==1.3.1
		async-timeout==4.0.3
		psycopg2==2.9.9
		attrs==23.1.0
		certifi==2023.7.22
		charset-normalizer==3.2.0
Expand Down