Custom Embeddings

LangChain is integrated with many3rd party embedding models. In this guide we'll show you how to create a custom Embedding class, in case a built-in one does not already exist. Embeddings are critical in natural language processing applications as they convert text into a numerical form that algorithms can understand, thereby enabling a wide range of applications such as similarity search, text classification, and clustering.

Implementing embeddings using the standardEmbeddings interface will allow your embeddings to be utilized in existingLangChain abstractions (e.g., as the embeddings powering aVectorStore or cached usingCacheBackedEmbeddings).

Interface

The currentEmbeddings abstraction in LangChain is designed to operate on text data. In this implementation, the inputs are either single strings or lists of strings, and the outputs are lists of numerical arrays (vectors), where each vector representsan embedding of the input text into some n-dimensional space.

Your custom embedding must implement the following methods:

Method/Property	Description	Required/Optional
`embed_documents(texts)`	Generates embeddings for a list of strings.	Required
`embed_query(text)`	Generates an embedding for a single text query.	Required
`aembed_documents(texts)`	Asynchronously generates embeddings for a list of strings.	Optional
`aembed_query(text)`	Asynchronously generates an embedding for a single text query.	Optional

These methods ensure that your embedding model can be integrated seamlessly into the LangChain framework, providing both synchronous and asynchronous capabilities for scalability and performance optimization.

note

Embeddings do not currently implement theRunnable interface and are alsonot instances of pydanticBaseModel.

Embedding queries vs documents

Theembed_query andembed_documents methods are required. These methods both operateon string inputs. The accessing ofDocument.page_content attributes is handledby the vector store using the embedding model for legacy reasons.

embed_query takes in a single string and returns a single embedding as a list of floats.If your model has different modes for embedding queries vs the underlying documents, you canimplement this method to handle that.

embed_documents takes in a list of strings and returns a list of embeddings as a list of lists of floats.

note

embed_documents takes in a list of plain text, not a list of LangChainDocument objects. The name of this methodmay change in future versions of LangChain.

Implementation

As an example, we'll implement a simple embeddings model that returns a constant vector. This model is for illustrative purposes only.

from typingimport List

from langchain_core.embeddingsimport Embeddings


classParrotLinkEmbeddings(Embeddings):
"""ParrotLink embedding model integration.

    # TODO: Populate with relevant params.
    Key init args — completion params:
        model: str
            Name of ParrotLink model to use.

    See full list of supported init args and their descriptions in the params section.

    # TODO: Replace with relevant init params.
    Instantiate:
        .. code-block:: python

            from langchain_parrot_link import ParrotLinkEmbeddings

            embed = ParrotLinkEmbeddings(
                model="...",
                # api_key="...",
                # other params...
            )

    Embed single text:
        .. code-block:: python

            input_text = "The meaning of life is 42"
            embed.embed_query(input_text)

        .. code-block:: python

            # TODO: Example output.

    # TODO: Delete if token-level streaming isn't supported.
    Embed multiple text:
        .. code-block:: python

             input_texts = ["Document 1...", "Document 2..."]
            embed.embed_documents(input_texts)

        .. code-block:: python

            # TODO: Example output.

    # TODO: Delete if native async isn't supported.
    Async:
        .. code-block:: python

            await embed.aembed_query(input_text)

            # multiple:
            # await embed.aembed_documents(input_texts)

        .. code-block:: python

            # TODO: Example output.

    """

def__init__(self, model:str):
        self.model= model

defembed_documents(self, texts: List[str])-> List[List[float]]:
"""Embed search docs."""
return[[0.5,0.6,0.7]for _in texts]

defembed_query(self, text:str)-> List[float]:
"""Embed query text."""
return self.embed_documents([text])[0]

# optional: add custom async implementations here
# you can also delete these, and the base class will
# use the default implementation, which calls the sync
# version in an async executor:

# async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
#     """Asynchronous Embed search docs."""
#     ...

# async def aembed_query(self, text: str) -> List[float]:
#     """Asynchronous Embed query text."""
#     ...

API Reference:Embeddings

Let's test it 🧪

embeddings= ParrotLinkEmbeddings("test-model")
print(embeddings.embed_documents(["Hello","world"]))
print(embeddings.embed_query("Hello"))

[[0.5, 0.6, 0.7], [0.5, 0.6, 0.7]]
[0.5, 0.6, 0.7]

Contributing

We welcome contributions of Embedding models to the LangChain code base.

If you aim to contribute an embedding model for a new provider (e.g., with a new set of dependencies or SDK), we encourage you to publish your implementation in a separatelangchain-* integration package. This will enable you to appropriately manage dependencies and version your package. Please refer to ourcontributing guide for a walkthrough of this process.

Movatterモバイル変換

Interface​

Embedding queries vs documents​

Implementation​

Let's test it 🧪​

Contributing​

Interface

Embedding queries vs documents

Implementation

Let's test it 🧪

Contributing