Vectorizers#

HFTextVectorizer#

classHFTextVectorizer(model='sentence-transformers/all-mpnet-base-v2',dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The HFTextVectorizer class leverages Hugging Face’s Sentence Transformersfor generating vector embeddings from text input.

This vectorizer is particularly useful in scenarios where advanced natural languageprocessing and understanding are required, and ideal for running on your ownhardware without usage fees.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

Utilizing this vectorizer involves specifying a pre-trained model fromHugging Face’s vast collection of Sentence Transformers. These models aretrained on a variety of datasets and tasks, ensuring versatility androbust performance across different embedding needs.

Requirements:
  • Thesentence-transformers library must be installed with pip.

# Basic usagevectorizer=HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")embedding=vectorizer.embed("Hello, world!")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="my_embeddings_cache")vectorizer=HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2",cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed("Hello, world!")# Second call will retrieve from cacheembedding2=vectorizer.embed("Hello, world!")# Batch processingembeddings=vectorizer.embed_many(["Hello, world!","How are you?"],batch_size=2)

Initialize the Hugging Face text vectorizer.

Parameters:
  • model (str) – The pre-trained model from Hugging Face’s SentenceTransformers to be used for embedding. Defaults to‘sentence-transformers/all-mpnet-base-v2’.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • **kwargs – Additional parameters to pass to the SentenceTransformerconstructor.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ImportError – If the sentence-transformers library is not installed.

  • ValueError – If there is an error setting the embedding model dimensions.

  • ValueError – If an invalid dtype is provided.

model_post_init(context,/)#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self (BaseModel) – The BaseModel instance.

  • context (Any) – The context.

Return type:

None

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

OpenAITextVectorizer#

classOpenAITextVectorizer(model='text-embedding-ada-002',api_config=None,dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The OpenAITextVectorizer class utilizes OpenAI’s API to generateembeddings for text data.

This vectorizer is designed to interact with OpenAI’s embeddings API,requiring an API key for authentication. The key can be provided directlyin theapi_config dictionary or through theOPENAI_API_KEY environmentvariable. Users must obtain an API key from OpenAI’s website(https://api.openai.com/). Additionally, theopenai python client must beinstalled withpip install openai>=1.13.0.

The vectorizer supports both synchronous and asynchronous operations,allowing for batch processing of texts and flexibility in handlingpreprocessing tasks.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

# Basic usage with OpenAI embeddingsvectorizer=OpenAITextVectorizer(model="text-embedding-ada-002",api_config={"api_key":"your_api_key"}# OR set OPENAI_API_KEY in your env)embedding=vectorizer.embed("Hello, world!")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="openai_embeddings_cache")vectorizer=OpenAITextVectorizer(model="text-embedding-ada-002",api_config={"api_key":"your_api_key"},cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed("Hello, world!")# Second call will retrieve from cacheembedding2=vectorizer.embed("Hello, world!")# Asynchronous batch embedding of multiple textsembeddings=awaitvectorizer.aembed_many(["Hello, world!","How are you?"],batch_size=2)

Initialize the OpenAI vectorizer.

Parameters:
  • model (str) – Model to use for embedding. Defaults to‘text-embedding-ada-002’.

  • api_config (Optional[Dict],optional) – Dictionary containing theAPI key and any additional OpenAI API options. Defaults to None.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ImportError – If the openai library is not installed.

  • ValueError – If the OpenAI API key is not provided.

  • ValueError – If an invalid dtype is provided.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

AzureOpenAITextVectorizer#

classAzureOpenAITextVectorizer(model='text-embedding-ada-002',api_config=None,dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The AzureOpenAITextVectorizer class utilizes AzureOpenAI’s API to generateembeddings for text data.

This vectorizer is designed to interact with AzureOpenAI’s embeddings API,requiring an API key, an AzureOpenAI deployment endpoint and API version.These values can be provided directly in theapi_config dictionary withthe parameters ‘azure_endpoint’, ‘api_version’ and ‘api_key’ or through theenvironment variables ‘AZURE_OPENAI_ENDPOINT’, ‘OPENAI_API_VERSION’, and ‘AZURE_OPENAI_API_KEY’.Users must obtain these values from the ‘Keys and Endpoints’ section in their Azure OpenAI service.Additionally, theopenai python client must be installed withpip install openai>=1.13.0.

The vectorizer supports both synchronous and asynchronous operations,allowing for batch processing of texts and flexibility in handlingpreprocessing tasks.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

# Basic usagevectorizer=AzureOpenAITextVectorizer(model="text-embedding-ada-002",api_config={"api_key":"your_api_key",# OR set AZURE_OPENAI_API_KEY in your env"api_version":"your_api_version",# OR set OPENAI_API_VERSION in your env"azure_endpoint":"your_azure_endpoint",# OR set AZURE_OPENAI_ENDPOINT in your env})embedding=vectorizer.embed("Hello, world!")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="azureopenai_embeddings_cache")vectorizer=AzureOpenAITextVectorizer(model="text-embedding-ada-002",api_config={"api_key":"your_api_key","api_version":"your_api_version","azure_endpoint":"your_azure_endpoint",},cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed("Hello, world!")# Second call will retrieve from cacheembedding2=vectorizer.embed("Hello, world!")# Asynchronous batch embedding of multiple textsembeddings=awaitvectorizer.aembed_many(["Hello, world!","How are you?"],batch_size=2)

Initialize the AzureOpenAI vectorizer.

Parameters:
  • model (str) – Deployment to use for embedding. Must be the‘Deployment name’ not the ‘Model name’. Defaults to‘text-embedding-ada-002’.

  • api_config (Optional[Dict],optional) – Dictionary containing theAPI key, API version, Azure endpoint, and any other API options.Defaults to None.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ImportError – If the openai library is not installed.

  • ValueError – If the AzureOpenAI API key, version, or endpoint are not provided.

  • ValueError – If an invalid dtype is provided.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

VertexAITextVectorizer#

classVertexAITextVectorizer(model='textembedding-gecko',api_config=None,dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The VertexAITextVectorizer uses Google’s VertexAI Palm 2 embedding modelAPI to create text embeddings.

This vectorizer is tailored for use inenvironments where integration with Google Cloud Platform (GCP) services isa key requirement.

Utilizing this vectorizer requires an active GCP project and location(region), along with appropriate application credentials. These can beprovided through theapi_config dictionary or set the GOOGLE_APPLICATION_CREDENTIALSenv var. Additionally, the vertexai python client must beinstalled withpip install google-cloud-aiplatform>=1.26.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

# Basic usagevectorizer=VertexAITextVectorizer(model="textembedding-gecko",api_config={"project_id":"your_gcp_project_id",# OR set GCP_PROJECT_ID"location":"your_gcp_location",# OR set GCP_LOCATION})embedding=vectorizer.embed("Hello, world!")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="vertexai_embeddings_cache")vectorizer=VertexAITextVectorizer(model="textembedding-gecko",api_config={"project_id":"your_gcp_project_id","location":"your_gcp_location",},cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed("Hello, world!")# Second call will retrieve from cacheembedding2=vectorizer.embed("Hello, world!")# Batch embedding of multiple textsembeddings=vectorizer.embed_many(["Hello, world!","Goodbye, world!"],batch_size=2)

Initialize the VertexAI vectorizer.

Parameters:
  • model (str) – Model to use for embedding. Defaults to‘textembedding-gecko’.

  • api_config (Optional[Dict],optional) – Dictionary containing theAPI config details. Defaults to None.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ImportError – If the google-cloud-aiplatform library is not installed.

  • ValueError – If the API key is not provided.

  • ValueError – If an invalid dtype is provided.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

CohereTextVectorizer#

classCohereTextVectorizer(model='embed-english-v3.0',api_config=None,dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The CohereTextVectorizer class utilizes Cohere’s API to generateembeddings for text data.

This vectorizer is designed to interact with Cohere’s /embed API,requiring an API key for authentication. The key can be provideddirectly in theapi_config dictionary or through theCOHERE_API_KEYenvironment variable. User must obtain an API key from Cohere’s website(https://dashboard.cohere.com/). Additionally, thecohere pythonclient must be installed withpip install cohere.

The vectorizer supports only synchronous operations, allows for batchprocessing of texts and flexibility in handling preprocessing tasks.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

fromredisvl.utils.vectorizeimportCohereTextVectorizer# Basic usagevectorizer=CohereTextVectorizer(model="embed-english-v3.0",api_config={"api_key":"your-cohere-api-key"}# OR set COHERE_API_KEY in your env)query_embedding=vectorizer.embed(text="your input query text here",input_type="search_query")doc_embeddings=vectorizer.embed_many(texts=["your document text","more document text"],input_type="search_document")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="cohere_embeddings_cache")vectorizer=CohereTextVectorizer(model="embed-english-v3.0",api_config={"api_key":"your-cohere-api-key"},cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed(text="your input query text here",input_type="search_query")# Second call will retrieve from cacheembedding2=vectorizer.embed(text="your input query text here",input_type="search_query")

Initialize the Cohere vectorizer.

Visithttps://cohere.ai/embed to learn about embeddings.

Parameters:
  • model (str) – Model to use for embedding. Defaults to ‘embed-english-v3.0’.

  • api_config (Optional[Dict],optional) – Dictionary containing the API key.Defaults to None.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().‘float32’ will use Cohere’s float embeddings, ‘int8’ and ‘uint8’ will mapto Cohere’s corresponding embedding types. Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ImportError – If the cohere library is not installed.

  • ValueError – If the API key is not provided.

  • ValueError – If an invalid dtype is provided.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

BedrockTextVectorizer#

classBedrockTextVectorizer(model='amazon.titan-embed-text-v2:0',api_config=None,dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The AmazonBedrockTextVectorizer class utilizes Amazon Bedrock’s API to generateembeddings for text data.

This vectorizer is designed to interact with Amazon Bedrock API,requiring AWS credentials for authentication. The credentials can be provideddirectly in theapi_config dictionary or through environment variables:- AWS_ACCESS_KEY_ID- AWS_SECRET_ACCESS_KEY- AWS_REGION (defaults to us-east-1)

The vectorizer supports synchronous operations with batch processing andpreprocessing capabilities.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

# Basic usage with explicit credentialsvectorizer=AmazonBedrockTextVectorizer(model="amazon.titan-embed-text-v2:0",api_config={"aws_access_key_id":"your_access_key","aws_secret_access_key":"your_secret_key","aws_region":"us-east-1"})# With environment variables and cachingfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="bedrock_embeddings_cache")vectorizer=AmazonBedrockTextVectorizer(model="amazon.titan-embed-text-v2:0",cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed("Hello, world!")# Second call will retrieve from cacheembedding2=vectorizer.embed("Hello, world!")# Generate batch embeddingsembeddings=vectorizer.embed_many(["Hello","World"],batch_size=2)

Initialize the AWS Bedrock Vectorizer.

Parameters:
  • model (str) – The Bedrock model ID to use. Defaults to amazon.titan-embed-text-v2:0

  • api_config (Optional[Dict[str,str]]) – AWS credentials and config.Can include: aws_access_key_id, aws_secret_access_key, aws_regionIf not provided, will use environment variables.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ValueError – If credentials are not provided in config or environment.

  • ImportError – If boto3 is not installed.

  • ValueError – If an invalid dtype is provided.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

CustomTextVectorizer#

classCustomTextVectorizer(embed,embed_many=None,aembed=None,aembed_many=None,dtype='float32',cache=None)[source]#

Bases:BaseVectorizer

The CustomTextVectorizer class wraps user-defined embedding methods to createembeddings for text data.

This vectorizer is designed to accept a provided callable text vectorizer andprovides a class definition to allow for compatibility with RedisVL.The vectorizer may support both synchronous and asynchronous operations whichallows for batch processing of texts, but at a minimum only synchronous embeddingis required to satisfy the ‘embed()’ method.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

# Basic usage with a custom embedding functionvectorizer=CustomTextVectorizer(embed=my_vectorizer.generate_embedding)embedding=vectorizer.embed("Hello, world!")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="my_embeddings_cache")vectorizer=CustomTextVectorizer(embed=my_vectorizer.generate_embedding,cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed("Hello, world!")# Second call will retrieve from cacheembedding2=vectorizer.embed("Hello, world!")# Asynchronous batch embedding of multiple textsembeddings=awaitvectorizer.aembed_many(["Hello, world!","How are you?"],batch_size=2)

Initialize the Custom vectorizer.

Parameters:
  • embed (Callable) – a Callable function that accepts a string object and returns a list of floats.

  • embed_many (Optional[Callable]) – a Callable function that accepts a list of string objects and returns a list containing lists of floats. Defaults to None.

  • aembed (Optional[Callable]) – an asynchronous Callable function that accepts a string object and returns a lists of floats. Defaults to None.

  • aembed_many (Optional[Callable]) – an asynchronous Callable function that accepts a list of string objects and returns a list containing lists of floats. Defaults to None.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

Raises:

ValueError – if embedding validation fails.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.

VoyageAITextVectorizer#

classVoyageAITextVectorizer(model='voyage-large-2',api_config=None,dtype='float32',cache=None,*,dims=None)[source]#

Bases:BaseVectorizer

The VoyageAITextVectorizer class utilizes VoyageAI’s API to generateembeddings for text data.

This vectorizer is designed to interact with VoyageAI’s /embed API,requiring an API key for authentication. The key can be provideddirectly in theapi_config dictionary or through theVOYAGE_API_KEYenvironment variable. User must obtain an API key from VoyageAI’s website(https://dash.voyageai.com/). Additionally, thevoyageai pythonclient must be installed withpip install voyageai.

The vectorizer supports both synchronous and asynchronous operations, allows for batchprocessing of texts and flexibility in handling preprocessing tasks.

You can optionally enable caching to improve performance when generatingembeddings for repeated text inputs.

fromredisvl.utils.vectorizeimportVoyageAITextVectorizer# Basic usagevectorizer=VoyageAITextVectorizer(model="voyage-large-2",api_config={"api_key":"your-voyageai-api-key"}# OR set VOYAGE_API_KEY in your env)query_embedding=vectorizer.embed(text="your input query text here",input_type="query")doc_embeddings=vectorizer.embed_many(texts=["your document text","more document text"],input_type="document")# With caching enabledfromredisvl.extensions.cache.embeddingsimportEmbeddingsCachecache=EmbeddingsCache(name="voyageai_embeddings_cache")vectorizer=VoyageAITextVectorizer(model="voyage-large-2",api_config={"api_key":"your-voyageai-api-key"},cache=cache)# First call will compute and cache the embeddingembedding1=vectorizer.embed(text="your input query text here",input_type="query")# Second call will retrieve from cacheembedding2=vectorizer.embed(text="your input query text here",input_type="query")

Initialize the VoyageAI vectorizer.

Visithttps://docs.voyageai.com/docs/embeddings to learn about embeddings and check the available models.

Parameters:
  • model (str) – Model to use for embedding. Defaults to “voyage-large-2”.

  • api_config (Optional[Dict],optional) – Dictionary containing the API key.Defaults to None.

  • dtype (str) – the default datatype to use when embedding text as byte arrays.Used when settingas_buffer=True in calls to embed() and embed_many().Defaults to ‘float32’.

  • cache (Optional[EmbeddingsCache]) – Optional EmbeddingsCache instance to cache embeddings forbetter performance with repeated texts. Defaults to None.

  • dims (Annotated[int |None,FieldInfo(annotation=NoneType,required=True,metadata=[Strict(strict=True),Gt(gt=0)])])

Raises:
  • ImportError – If the voyageai library is not installed.

  • ValueError – If the API key is not provided.

model_config:ClassVar[ConfigDict]={'arbitrary_types_allowed':True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

propertytype:str#

Return the type of vectorizer.