Multimodality

Overview

Multimodality refers to the ability to work with data that comes in different forms, such as text, audio, images, and video. Multimodality can appear in various components, allowing models and systems to handle and process a mix of these data types seamlessly.

Chat Models: These could, in theory, accept and generate multimodal inputs and outputs, handling a variety of data types like text, images, audio, and video.
Embedding Models: Embedding Models can represent multimodal content, embedding various forms of data—such as text, images, and audio—into vector spaces.
Vector Stores: Vector stores could search over embeddings that represent multimodal data, enabling retrieval across different types of information.

Multimodality in chat models

Pre-requisites

LangChain supports multimodal data as input to chat models:

Following provider-specific formats
Adhering to a cross-provider standard (seehow-to guides for detail)

How to use multimodal models

Use thechat model integration table to identify which models support multimodality.
Reference therelevant how-to guides for specific examples of how to use multimodal models.

What kind of multimodality is supported?

Inputs

Some models can accept multimodal inputs, such as images, audio, video, or files.The types of multimodal inputs supported depend on the model provider. For instance,OpenAI,Anthropic, andGoogle Geminisupport documents like PDFs as inputs.

The gist of passing multimodal inputs to a chat model is to use content blocks thatspecify a type and corresponding data. For example, to pass an image to a chat modelas URL:

from langchain_core.messagesimport HumanMessage

message= HumanMessage(
    content=[
{"type":"text","text":"Describe the weather in this image:"},
{
"type":"image",
"source_type":"url",
"url":"https://...",
},
],
)
response= model.invoke([message])

API Reference:HumanMessage

We can also pass the image as in-line data:

from langchain_core.messagesimport HumanMessage

message= HumanMessage(
    content=[
{"type":"text","text":"Describe the weather in this image:"},
{
"type":"image",
"source_type":"base64",
"data":"<base64 string>",
"mime_type":"image/jpeg",
},
],
)
response= model.invoke([message])

API Reference:HumanMessage

To pass a PDF file as in-line data (or URL, as supported by providers such asAnthropic), just change"type" to"file" and"mime_type" to"application/pdf".

See thehow-to guides for more detail.

Most chat models that support multimodalimage inputs also accept those values inOpenAI'sChat Completions format:

from langchain_core.messagesimport HumanMessage

message= HumanMessage(
    content=[
{"type":"text","text":"Describe the weather in this image:"},
{"type":"image_url","image_url":{"url": image_url}},
],
)
response= model.invoke([message])

API Reference:HumanMessage

Otherwise, chat models will typically accept the native, provider-specific contentblock format. Seechat model integrations for detailon specific providers.

Outputs

Some chat models support multimodal outputs, such as images and audio. Multimodaloutputs will appear as part of theAIMessageresponse object. See for example:

Generatingaudio outputs with OpenAI;
Generatingimage outputs with Google Gemini.

Tools

Currently, no chat model is designed to workdirectly with multimodal data in atool call request orToolMessage result.

However, a chat model can easily interact with multimodal data by invoking tools with references (e.g., a URL) to the multimodal data, rather than the data itself. For example, any model capable oftool calling can be equipped with tools to download and process images, audio, or video.

Multimodality in embedding models

Prerequisites

Embedding Models

Embeddings are vector representations of data used for tasks like similarity search and retrieval.

The currentembedding interface used in LangChain is optimized entirely for text-based data, and willnot work with multimodal data.

As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the embedding interface to accommodate other data types like images, audio, and video.

Multimodality in vector stores

Prerequisites

Vector stores

Vector stores are databases for storing and retrieving embeddings, which are typically used in search and retrieval tasks. Similar to embeddings, vector stores are currently optimized for text-based data.

As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the vector store interface to accommodate other data types like images, audio, and video.

Movatterモバイル変換

Overview​

Multimodality in chat models​

How to use multimodal models​

What kind of multimodality is supported?​

Inputs​

Outputs​

Tools​

Multimodality in embedding models​

Multimodality in vector stores​

Overview

Multimodality in chat models

How to use multimodal models

What kind of multimodality is supported?

Inputs

Outputs

Tools

Multimodality in embedding models

Multimodality in vector stores