Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

timescale/python-vector

Repository files navigation

PostgreSQL++ for AI Applications.

  • Signup for TimescaleVector:Get 90 days free to try Timescale Vector on the Timescale cloud dataplatform. There is no self-managed version at this time.
  • Documentation: Learn thekey features of Timescale Vector and how to use them.
  • Getting StartedTutorial:Learn how to use Timescale Vector for semantic search on a real-worlddataset.
  • Learnmore:Learn more about Timescale Vector, how it works and why we built it.

If you prefer to use an LLM development or data framework, see TimescaleVector’s integrations withLangChainandLlamaIndex

Install

To install the main library use:

pip install timescale_vector

We also usedotenv in our examples for passing around secrets andkeys. You can install that with:

pip install python-dotenv

If you run into installation errors related to the psycopg2 package, youwill need to install some prerequisites. The timescale-vector packageexplicitly depends on psycopg2 (the non-binary version). This adheres tothe advice provided bypsycopg2.Building psycopg from sourcerequires a few prerequisites to beinstalled.Make sure these are installed before trying topip install timescale_vector.

Basic usage

First, import all the necessary libraries:

fromdotenvimportload_dotenv,find_dotenvimportosfromtimescale_vectorimportclientimportuuidfromdatetimeimportdatetime,timedelta

Load up your PostgreSQL credentials. Safest way is with a .env file:

_=load_dotenv(find_dotenv(),override=True)service_url=os.environ['TIMESCALE_SERVICE_URL']

Next, create the client. In this tutorial, we will use the sync client.But we have an async client as well (with an identical interface thatuses async functions).

The client constructor takes three required arguments:

namedescription
service_urlTimescale service URL / connection string
table_nameName of the table to use for storing the embeddings. Think of this as the collection name
num_dimensionsNumber of dimensions in the vector

You can also specify the schema name, distance type, primary key type,etc. as optional parameters. Please see the documentation for details.

vec=client.Sync(service_url,"my_data",2)

Next, create the tables for the collection:

vec.create_tables()

Next, insert some data. The data record contains:

  • A UUID to uniquely identify the embedding
  • A JSON blob of metadata about the embedding
  • The text the embedding represents
  • The embedding itself

Because this data includes UUIDs which become primary keys, we ingestwith upserts.

vec.upsert([\    (uuid.uuid1(), {"animal":"fox"},"the brown fox", [1.0,1.3]),\    (uuid.uuid1(), {"animal":"fox","action":"jump"},"jumped over the", [1.0,10.8]),\])

You can now create a vector index to speed up similarity search:

vec.create_embedding_index(client.DiskAnnIndex())

Now, you can query for similar items:

vec.search([1.0,9.0])
[[UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),  {'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('4494c12c-4a0d-11ef-94a3-6ee10b77fd09'),  {'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

There are many search options which we will cover below in theAdvanced search section.

As one example, we will return one item using a similarity searchconstrained by a metadata filter.

vec.search([1.0,9.0],limit=1,filter={"action":"jump"})
[[UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),  {'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

The returned records contain 5 fields:

namedescription
idThe UUID of the record
metadataThe JSON metadata associated with the record
contentsthe text content that was embedded
embeddingThe vector embedding
distanceThe distance between the query embedding and the vector

You can access the fields by simply using the record as a dictionarykeyed on the field name:

records=vec.search([1.0,9.0],limit=1,filter={"action":"jump"})(records[0]["id"],records[0]["metadata"],records[0]["contents"],records[0]["embedding"],records[0]["distance"])
(UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'), {'action': 'jump', 'animal': 'fox'}, 'jumped over the', array([ 1. , 10.8], dtype=float32), 0.00016793422934946456)

You can delete by ID:

vec.delete_by_ids([records[0]["id"]])

Or you can delete by metadata filters:

vec.delete_by_metadata({"action":"jump"})

To delete all records use:

vec.delete_all()

Advanced usage

In this section, we will go into more detail about our feature. We willcover:

  1. Search filter options - how to narrow your search by additionalconstraints
  2. Indexing - how to speed up your similarity queries
  3. Time-based partitioning - how to optimize similarity queries thatfilter on time
  4. Setting different distance types to use in distance calculations

Search options

Thesearch function is very versatile and allows you to search for theright vector in a wide variety of ways. We’ll describe the search optionin 3 parts:

  1. We’ll cover basic similarity search.
  2. Then, we’ll describe how to filter your search based on theassociated metadata.
  3. Finally, we’ll talk about filtering on time when time-partitioningis enabled.

Let’s use the following data for our example:

vec.upsert([\    (uuid.uuid1(), {"animal":"fox","action":"sit","times":1},"the brown fox", [1.0,1.3]),\    (uuid.uuid1(),  {"animal":"fox","action":"jump","times":100},"jumped over the", [1.0,10.8]),\])

The basic query looks like:

vec.search([1.0,9.0])
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

You could provide a limit for the number of items returned:

vec.search([1.0,9.0],limit=1)
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

Narrowing your search by metadata

We have two main ways to filter results by metadata: -filters forequality matches on metadata. -predicates for complex conditions onmetadata.

Filters are more likely to be performant but are more limited in whatthey can express, so we suggest using those if your use case allows it.

Filters

You could specify a match on the metadata as a dictionary where all keyshave to match the provided values (keys not in the filter areunconstrained):

vec.search([1.0,9.0],limit=1,filter={"action":"sit"})
[[UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

You can also specify a list of filter dictionaries, where an item isreturned if it matches any dict:

vec.search([1.0,9.0],limit=2,filter=[{"action":"jump"}, {"animal":"fox"}])
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]
Predicates

Predicates allow for more complex search conditions. For example, youcould use greater than and less than conditions on numeric values.

vec.search([1.0,9.0],limit=2,predicates=client.Predicates("times",">",1))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

Predicatesobjects are defined by the name of the metadata key, an operator, and avalue.

The supported operators are:==,!=,<,<=,>,>=

The type of the values determines the type of comparison to perform. Forexample, passing in"Sam" (a string) will do a string comparison whilea10 (an int) will perform an integer comparison while a10.0(float) will do a float comparison. It is important to note that using avalue of"10" will do a string comparison as well so it’s important touse the right type. Supported Python types are:str,int, andfloat. One more example with a string comparison:

vec.search([1.0,9.0],limit=2,predicates=client.Predicates("action","==","jump"))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

The real power of predicates is that they can also be combined using the& operator (for combining predicates with AND semantics) and|(forcombining using OR semantic). So you can do:

vec.search([1.0,9.0],limit=2,predicates=client.Predicates("action","==","jump")&client.Predicates("times",">",1))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

Just for sanity, let’s show a case where no results are returned becauseor predicates:

vec.search([1.0,9.0],limit=2,predicates=client.Predicates("action","==","jump")&client.Predicates("times","==",1))
[]

And one more example where we define the predicates as a variable anduse grouping with parenthesis:

my_predicates=client.Predicates("action","==","jump")& (client.Predicates("times","==",1)|client.Predicates("times",">",1))vec.search([1.0,9.0],limit=2,predicates=my_predicates)
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

We also have some semantic sugar for combining many predicates with ANDsemantics. You can pass in multiple 3-tuples toPredicates:

vec.search([1.0,9.0],limit=2,predicates=client.Predicates(("action","==","jump"), ("times",">",10)))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

Filter your search by time

When usingtime-partitioning(see below). You can very efficientlyfilter your search by time. Time-partitioning makes a timestamp embeddedas part of the UUID-based ID associated with an embedding. Let us firstcreate a collection with time partitioning and insert some data (oneitem from January 2018 and another in January 2019):

tpvec=client.Sync(service_url,"time_partitioned_table",2,time_partition_interval=timedelta(hours=6))tpvec.create_tables()specific_datetime=datetime(2018,1,1,12,0,0)tpvec.upsert([\    (client.uuid_from_time(specific_datetime), {"animal":"fox","action":"sit","times":1},"the brown fox", [1.0,1.3]),\    (client.uuid_from_time(specific_datetime+timedelta(days=365)),  {"animal":"fox","action":"jump","times":100},"jumped over the", [1.0,10.8]),\])

Then, you can filter using the timestamps by specifing auuid_time_filter:

tpvec.search([1.0,9.0],limit=4,uuid_time_filter=client.UUIDTimeRange(specific_datetime,specific_datetime+timedelta(days=1)))
[[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

AUUIDTimeRangecan specify a start_date or end_date or both(as in the example above).Specifying only the start_date or end_date leaves the other endunconstrained.

tpvec.search([1.0,9.0],limit=4,uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime))
[[UUID('ac8be800-0de6-11e9-a5fd-5a100e653c25'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

You have the option to define the inclusivity of the start and end dateswith thestart_inclusive andend_inclusive parameters. Settingstart_inclusive to true results in comparisons using the>=operator, whereas setting it to false applies the> operator. Bydefault, the start date is inclusive, while the end date is exclusive.One example:

tpvec.search([1.0,9.0],limit=4,uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime,start_inclusive=False))
[[UUID('ac8be800-0de6-11e9-a5fd-5a100e653c25'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456]]

Notice how the results are different when we use thestart_inclusive=False option because the first row has the exacttimestamp specified bystart_date.

We’ve also made it easy to integrate time filters using thefilter andpredicates parameters described above using special reserved key namesto make it appear that the timestamps are part of your metadata. Wefound this useful when integrating with other systems that just want tospecify a set of filters (often these are “auto retriever” typesystems). The reserved key names are__start_date and__end_date forfilters and__uuid_timestamp for predicates. Some examples below:

tpvec.search([1.0,9.0],limit=4,filter={"__start_date":specific_datetime,"__end_date":specific_datetime+timedelta(days=1)})
[[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]
tpvec.search([1.0,9.0],limit=4,predicates=client.Predicates("__uuid_timestamp",">=",specific_datetime)&client.Predicates("__uuid_timestamp","<",specific_datetime+timedelta(days=1)))
[[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

Indexing

Indexing speeds up queries over your data. By default, we set up indexesto query your data by the UUID and the metadata.

But to speed up similarity search based on the embeddings, you have tocreate additional indexes.

Note that if performing a query without an index, you will always get anexact result, but the query will be slow (it has to read all of the datayou store for every query). With an index, your queries will beorder-of-magnitude faster, but the results are approximate (becausethere are no known indexing techniques that are exact).

Nevertheless, there are excellent approximate algorithms. There are 3different indexing algorithms available on the Timescale platform:Timescale Vector index, pgvector HNSW, and pgvector ivfflat. Below arethe trade-offs between these algorithms:

AlgorithmBuild speedQuery speedNeed to rebuild after updates
StreamingDiskANNFastFastestNo
pgvector hnswSlowestFasterNo
pgvector ivfflatFastestSlowestYes

You can seebenchmarkson our blog.

We recommend using the Timescale Vector index for most use cases. Thiscan be created with:

vec.create_embedding_index(client.DiskAnnIndex())

Indexes are created for a particular distance metric type. So it isimportant that the same distance metric is set on the client duringindex creation as it is during queries. See thedistance type sectionbelow.

Each of these indexes has a set of build-time options for controllingthe speed/accuracy trade-off when creating the index and an additionalquery-time option for controlling accuracy during a particular query. Wehave smart defaults for all of these options but will also describe thedetails below so that you can adjust these options manually.

StreamingDiskANN index

The StreamingDiskANN index from pgvectorscale is a graph-based algorithmthat uses theDiskANN algorithm.You can read more about it on ourblogannouncing its release.

To create this index, run:

vec.create_embedding_index(client.DiskAnnIndex())

The above command will create the index using smart defaults. There area number of parameters you could tune to adjust the accuracy/speedtrade-off.

The parameters you can set at index build time are:

Parameter nameDescriptionDefault value
storage_layoutmemory_optimized which uses SBQ to compress vector data orplain which stores data uncompressedmemory_optimized
num_neighborsSets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.50
search_list_sizeThis is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds.100
max_alphaIs the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds.1.2
num_dimensionsThe number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use ofMatryoshka embeddings0 (all dimensions)
num_bits_per_dimensionNumber of bits used to encode each dimension when using SBQ2 for less than 900 dimensions, 1 otherwise

To set these parameters, you could run:

vec.create_embedding_index(client.DiskAnnIndex(num_neighbors=50,search_list_size=100,max_alpha=1.0,storage_layout="memory_optimized",num_dimensions=0,num_bits_per_dimension=1))

You can also set a parameter to control the accuracy vs. query speedtrade-off at query time. The parameter is set in thesearch() functionusing thequery_params argment.

Parameter nameDescriptionDefault value
search_list_sizeThe number of additional candidates considered during the graph search.100
rescoreThe number of elements rescored (0 to disable rescoring)50

We suggest using therescore parameter to fine-tune accuracy.

vec.search([1.0,9.0],limit=4,query_params=client.DiskAnnIndexParams(rescore=400,search_list_size=10))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

To drop the index, run:

vec.drop_embedding_index()

pgvector HNSW index

Pgvector provides a graph-based indexing algorithm based on the popularHNSW algorithm.

To create this index, run:

vec.create_embedding_index(client.HNSWIndex())

The above command will create the index using smart defaults. There area number of parameters you could tune to adjust the accuracy/speedtrade-off.

The parameters you can set at index build time are:

Parameter nameDescriptionDefault value
mRepresents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size.16
ef_constructionRepresents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing ef_construction enables more accurate search results at the expense of lengthier index build times.64

To set these parameters, you could run:

vec.create_embedding_index(client.HNSWIndex(m=16,ef_construction=64))

You can also set a parameter to control the accuracy vs. query speedtrade-off at query time. The parameter is set in thesearch() functionusing thequery_params argument. You can set theef_search(default:40). This parameter specifies the size of the dynamic candidate listused during search. Higher values improve query accuracy while makingthe query slower.

You can specify this value during search as follows:

vec.search([1.0,9.0],limit=4,query_params=client.HNSWIndexParams(ef_search=10))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

To drop the index run:

vec.drop_embedding_index()

pgvector ivfflat index

Pgvector provides a clustering-based indexing algorithm. Ourblogpostdescribes how it works in detail. It provides the fastest index-buildspeed but the slowest query speeds of any indexing algorithm.

To create this index, run:

vec.create_embedding_index(client.IvfflatIndex())

Note:ivfflat should never be created on empty tables because it needsto cluster data, and that only happens when an index is first created,not when new rows are inserted or modified. Also, if your tableundergoes a lot of modifications, you will need to rebuild this indexoccasionally to maintain good accuracy. See ourblogpostfor details.

Pgvector ivfflat has alists index parameter that is automatically setwith a smart default based on the number of rows in your table. If youknow that you’ll have a different table size, you can specify the numberof records to use for calculating thelists parameter as follows:

vec.create_embedding_index(client.IvfflatIndex(num_records=1000000))

You can also set thelists parameter directly:

vec.create_embedding_index(client.IvfflatIndex(num_lists=100))

You can also set a parameter to control the accuracy vs. query speedtrade-off at query time. The parameter is set in thesearch() functionusing thequery_params argument. You can set theprobes. Thisparameter specifies the number of clusters searched during a query. Itis recommended to set this parameter tosqrt(lists) where lists is thenum_list parameter used above during index creation. Higher valuesimprove query accuracy while making the query slower.

You can specify this value during search as follows:

vec.search([1.0,9.0],limit=4,query_params=client.IvfflatIndexParams(probes=10))
[[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 100, 'action': 'jump', 'animal': 'fox'},  'jumped over the',  array([ 1. , 10.8], dtype=float32),  0.00016793422934946456], [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),  {'times': 1, 'action': 'sit', 'animal': 'fox'},  'the brown fox',  array([1. , 1.3], dtype=float32),  0.14489260377438218]]

To drop the index, run:

vec.drop_embedding_index()

Time partitioning

In many use cases where you have many embeddings, time is an importantcomponent associated with the embeddings. For example, when embeddingnews stories, you often search by time as well as similarity (e.g.,stories related to Bitcoin in the past week or stories about Clinton inNovember 2016).

Yet, traditionally, searching by two components “similarity” and “time”is challenging for Approximate Nearest Neighbor (ANN) indexes and makesthe similarity-search index less effective.

One approach to solving this is partitioning the data by time andcreating ANN indexes on each partition individually. Then, duringsearch, you can:

  • Step 1: filter our partitions that don’t match the time predicate.
  • Step 2: perform the similarity search on all matching partitions.
  • Step 3: combine all the results from each partition in step 2, rerank,and filter out results by time.

Step 1 makes the search a lot more efficient by filtering out wholeswaths of data in one go.

Timescale-vector supports time partitioning using TimescaleDB’shypertables. To use this feature, simply indicate the length of time foreach partition when creating the client:

fromdatetimeimporttimedeltafromdatetimeimportdatetime
vec=client.Async(service_url,"my_data_with_time_partition",2,time_partition_interval=timedelta(hours=6))awaitvec.create_tables()

Then, insert data where the IDs use UUIDs v1 and the time component ofthe UUID specifies the time of the embedding. For example, to create anembedding for the current time, simply do:

id=uuid.uuid1()awaitvec.upsert([(id, {"key":"val"},"the brown fox", [1.0,1.2])])

To insert data for a specific time in the past, create the UUID usingouruuid_from_timefunction

specific_datetime=datetime(2018,8,10,15,30,0)awaitvec.upsert([(client.uuid_from_time(specific_datetime), {"key":"val"},"the brown fox", [1.0,1.2])])

You can then query the data by specifying auuid_time_filter in thesearch call:

rec=awaitvec.search([1.0,2.0],limit=4,uuid_time_filter=client.UUIDTimeRange(specific_datetime-timedelta(days=7),specific_datetime+timedelta(days=7)))

Distance metrics

By default, we use cosine distance to measure how similarly an embeddingis to a given query. In addition to cosine distance, we also supportEuclidean/L2 distance. The distance type is set when creating the clientusing thedistance_type parameter. For example, to use the Euclideandistance metric, you can create the client with:

vec=client.Sync(service_url,"my_data",2,distance_type="euclidean")

Valid values fordistance_type arecosine andeuclidean.

It is important to note that you should use consistent distance types onclients that create indexes and perform queries. That is because anindex is only valid for one particular type of distance measure.

Please note the Timescale Vector index only supports cosine distance atthis time.

LangChain integration

LangChain is a popular framework fordevelopment applications powered by LLMs. Timescale Vector has a nativeLangChain integration, enabling you to use Timescale Vector as avectorstore and leverage all its capabilities in your applications builtwith LangChain.

Here are resources about using Timescale Vector with LangChain:

LlamaIndex integration

[LlamaIndex] is a popular data framework for connecting custom datasources to large language models (LLMs). Timescale Vector has a nativeLlamaIndex integration, enabling you to use Timescale Vector as avectorstore and leverage all its capabilities in your applications builtwith LlamaIndex.

Here are resources about using Timescale Vector with LlamaIndex:

PgVectorize

PgVectorize enables you to create vector embeddings from any data thatyou already have stored in PostgreSQL. You can get more backgroundinformation in ourblogpostannouncing this feature, as well as a“how we builtin”post going into the details of the design.

To create vector embeddings, simply attach PgVectorize to any PostgreSQLtable, and it will automatically sync that table’s data with a set ofembeddings stored in Timescale Vector. For example, let’s say you have ablog table defined in the following way:

importpsycopg2fromlangchain.docstore.documentimportDocumentfromlangchain.text_splitterimportCharacterTextSplitterfromtimescale_vectorimportclient,pgvectorizerfromlangchain_openaiimportOpenAIEmbeddingsfromlangchain_community.vectorstores.timescalevectorimportTimescaleVectorfromdatetimeimporttimedelta
withpsycopg2.connect(service_url)asconn:withconn.cursor()ascursor:cursor.execute('''        CREATE TABLE IF NOT EXISTS blog (            id              SERIAL PRIMARY KEY NOT NULL,            title           TEXT NOT NULL,            author          TEXT NOT NULL,            contents        TEXT NOT NULL,            category        TEXT NOT NULL,            published_time  TIMESTAMPTZ NULL --NULL if not yet published        );        ''')

You can insert some data as follows:

withpsycopg2.connect(service_url)asconn:withconn.cursor()ascursor:cursor.execute('''            INSERT INTO blog (title, author, contents, category, published_time) VALUES ('First Post', 'Matvey Arye', 'some super interesting content about cats.', 'AI', '2021-01-01');        ''')

Now, say you want to embed these blogs in Timescale Vector. First, youneed to define anembed_and_write function that takes a set of blogposts, creates the embeddings, and writes them into TimescaleVector. Forexample, if using LangChain, it could look something like the following.

defget_document(blog):text_splitter=CharacterTextSplitter(chunk_size=1000,chunk_overlap=200,    )docs= []forchunkintext_splitter.split_text(blog['contents']):content=f"Author{blog['author']}, title:{blog['title']}, contents:{chunk}"metadata= {"id":str(client.uuid_from_time(blog['published_time'])),"blog_id":blog['id'],"author":blog['author'],"category":blog['category'],"published_time":blog['published_time'].isoformat(),        }docs.append(Document(page_content=content,metadata=metadata))returndocsdefembed_and_write(blog_instances,vectorizer):embedding=OpenAIEmbeddings()vector_store=TimescaleVector(collection_name="blog_embedding",service_url=service_url,embedding=embedding,time_partition_interval=timedelta(days=30),    )# delete old embeddings for all ids in the work queue. locked_id is a special column that is set to the primary key of the table being# embedded. For items that are deleted, it is the only key that is set.metadata_for_delete= [{"blog_id":blog['locked_id']}forbloginblog_instances]vector_store.delete_by_metadata(metadata_for_delete)documents= []forbloginblog_instances:# skip blogs that are not published yet, or are deleted (in which case it will be NULL)ifblog['published_time']!=None:documents.extend(get_document(blog))iflen(documents)==0:returntexts= [d.page_contentfordindocuments]metadatas= [d.metadatafordindocuments]ids= [d.metadata["id"]fordindocuments]vector_store.add_texts(texts,metadatas,ids)

Then, all you have to do is run the following code in a scheduled job(cron job, Lambda job, etc):

# this job should be run on a schedulevectorizer=pgvectorizer.Vectorize(service_url,'blog')whilevectorizer.process(embed_and_write)>0:pass

Every time that job runs, it will sync the table with your embeddings.It will sync all inserts, updates, and deletes to an embeddings tablecalledblog_embedding.

Now, you can simply search the embeddings as follows (again, usingLangChain in the example):

embedding=OpenAIEmbeddings()vector_store=TimescaleVector(collection_name="blog_embedding",service_url=service_url,embedding=embedding,time_partition_interval=timedelta(days=30),)res=vector_store.similarity_search_with_score("Blogs about cats")res
[(Document(metadata={'id': '334e4800-4bee-11eb-a52a-57b3c4a96ccb', 'author': 'Matvey Arye', 'blog_id': 1, 'category': 'AI', 'published_time': '2021-01-01T00:00:00-05:00'}, page_content='Author Matvey Arye, title: First Post, contents:some super interesting content about cats.'),  0.12680577303752072)]

Development

This project is developed withnbdev. Pleasesee that website for the development process.


[8]ページ先頭

©2009-2025 Movatter.jp