How to trim messages

Prerequisites

This guide assumes familiarity with the following concepts:

The methods in this guide also requirelangchain-core>=0.2.9.

All models have finite context windows, meaning there's a limit to how manytokens they can take as input. If you have very long messages or a chain/agent that accumulates a long message history, you'll need to manage the length of the messages you're passing in to the model.

trim_messages can be used to reduce the size of a chat history to a specified token count or specified message count.

If passing the trimmed chat history back into a chat model directly, the trimmed chat history should satisfy the following properties:

The resulting chat history should bevalid. Usually this means that the following properties should be satisfied:
- The chat historystarts with either (1) aHumanMessage or (2) aSystemMessage followed by aHumanMessage.
- The chat historyends with either aHumanMessage or aToolMessage.
- AToolMessage can only appear after anAIMessage that involved a tool call.
This can be achieved by settingstart_on="human" andends_on=("human", "tool").
It includes recent messages and drops old messages in the chat history.This can be achieved by settingstrategy="last".
Usually, the new chat history should include theSystemMessage if itwas present in the original chat history since theSystemMessage includesspecial instructions to the chat model. TheSystemMessage is almost alwaysthe first message in the history if present. This can be achieved by settinginclude_system=True.

Trimming based on token count

Here, we'll trim the chat history based on token count. The trimmed chat history will produce avalid chat history that includes theSystemMessage.

To keep the most recent messages, we setstrategy="last". We'll also setinclude_system=True to include theSystemMessage, andstart_on="human" to make sure the resulting chat history is valid.

This is a good default configuration when usingtrim_messages based on token count. Remember to adjusttoken_counter andmax_tokens for your use case.

Notice that for ourtoken_counter we can pass in a function (more on that below) or a language model (since language models have a message token counting method). It makes sense to pass in a model when you're trimming your messages to fit into the context window of that specific model:

pip install-qU langchain-openai

from langchain_core.messagesimport(
    AIMessage,
    HumanMessage,
    SystemMessage,
    ToolMessage,
    trim_messages,
)
from langchain_core.messages.utilsimport count_tokens_approximately

messages=[
    SystemMessage("you're a good assistant, you always respond with a joke."),
    HumanMessage("i wonder why it's called langchain"),
    AIMessage(
'Well, I guess they thought "WordRope" and "SentenceString" just didn\'t have the same ring to it!'
),
    HumanMessage("and who is harrison chasing anyways"),
    AIMessage(
"Hmmm let me think.\n\nWhy, he's probably chasing after the last cup of coffee in the office!"
),
    HumanMessage("what do you call a speechless parrot"),
]


trim_messages(
    messages,
# Keep the last <= n_count tokens of the messages.
    strategy="last",
# Remember to adjust based on your model
# or else pass a custom token_counter
    token_counter=count_tokens_approximately,
# Most chat models expect that chat history starts with either:
# (1) a HumanMessage or
# (2) a SystemMessage followed by a HumanMessage
# Remember to adjust based on the desired conversation
# length
    max_tokens=45,
# Most chat models expect that chat history starts with either:
# (1) a HumanMessage or
# (2) a SystemMessage followed by a HumanMessage
    start_on="human",
# Most chat models expect that chat history ends with either:
# (1) a HumanMessage or
# (2) a ToolMessage
    end_on=("human","tool"),
# Usually, we want to keep the SystemMessage
# if it's present in the original history.
# The SystemMessage has special instructions for the model.
    include_system=True,
    allow_partial=False,
)

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what do you call a speechless parrot', additional_kwargs={}, response_metadata={})]

Trimming based on message count

Alternatively, we can trim the chat history based onmessage count, by settingtoken_counter=len. In this case, each message will count as a single token, andmax_tokens will controlthe maximum number of messages.

This is a good default configuration when usingtrim_messages based on message count. Remember to adjustmax_tokens for your use case.

trim_messages(
    messages,
# Keep the last <= n_count tokens of the messages.
    strategy="last",
    token_counter=len,
# When token_counter=len, each message
# will be counted as a single token.
# Remember to adjust for your use case
    max_tokens=5,
# Most chat models expect that chat history starts with either:
# (1) a HumanMessage or
# (2) a SystemMessage followed by a HumanMessage
    start_on="human",
# Most chat models expect that chat history ends with either:
# (1) a HumanMessage or
# (2) a ToolMessage
    end_on=("human","tool"),
# Usually, we want to keep the SystemMessage
# if it's present in the original history.
# The SystemMessage has special instructions for the model.
    include_system=True,
)

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='and who is harrison chasing anyways', additional_kwargs={}, response_metadata={}),
 AIMessage(content="Hmmm let me think.\n\nWhy, he's probably chasing after the last cup of coffee in the office!", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what do you call a speechless parrot', additional_kwargs={}, response_metadata={})]

Advanced Usage

You can usetrim_messages as a building-block to create more complex processing logic.

If we want to allow splitting up the contents of a message we can specifyallow_partial=True:

trim_messages(
    messages,
    max_tokens=56,
    strategy="last",
    token_counter=count_tokens_approximately,
    include_system=True,
    allow_partial=True,
)

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 AIMessage(content="\nWhy, he's probably chasing after the last cup of coffee in the office!", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what do you call a speechless parrot', additional_kwargs={}, response_metadata={})]

By default, theSystemMessage will not be included, so you can drop it by either settinginclude_system=False or by dropping theinclude_system argument.

trim_messages(
    messages,
    max_tokens=45,
    strategy="last",
    token_counter=count_tokens_approximately,
)

[AIMessage(content="Hmmm let me think.\n\nWhy, he's probably chasing after the last cup of coffee in the office!", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what do you call a speechless parrot', additional_kwargs={}, response_metadata={})]

We can perform the flipped operation of getting thefirstmax_tokens by specifyingstrategy="first":

trim_messages(
    messages,
    max_tokens=45,
    strategy="first",
    token_counter=count_tokens_approximately,
)

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 HumanMessage(content="i wonder why it's called langchain", additional_kwargs={}, response_metadata={})]

Using`ChatModel` as a token counter

You can pass a ChatModel as a token-counter. This will useChatModel.get_num_tokens_from_messages. Let's demonstrate how to use it with OpenAI:

from langchain_openaiimport ChatOpenAI

trim_messages(
    messages,
    max_tokens=45,
    strategy="first",
    token_counter=ChatOpenAI(model="gpt-4o"),
)

API Reference:ChatOpenAI

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 HumanMessage(content="i wonder why it's called langchain", additional_kwargs={}, response_metadata={})]

Writing a custom token counter

We can write a custom token counter function that takes in a list of messages and returns an int.

pip install-qU tiktoken

from typingimport List

import tiktoken
from langchain_core.messagesimport BaseMessage, ToolMessage


defstr_token_counter(text:str)->int:
    enc= tiktoken.get_encoding("o200k_base")
returnlen(enc.encode(text))


deftiktoken_counter(messages: List[BaseMessage])->int:
"""Approximately reproduce https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

    For simplicity only supports str Message.contents.
    """
    num_tokens=3# every reply is primed with <|start|>assistant<|message|>
    tokens_per_message=3
    tokens_per_name=1
for msgin messages:
ifisinstance(msg, HumanMessage):
            role="user"
elifisinstance(msg, AIMessage):
            role="assistant"
elifisinstance(msg, ToolMessage):
            role="tool"
elifisinstance(msg, SystemMessage):
            role="system"
else:
raise ValueError(f"Unsupported messages type{msg.__class__}")
        num_tokens+=(
            tokens_per_message
+ str_token_counter(role)
+ str_token_counter(msg.content)
)
if msg.name:
            num_tokens+= tokens_per_name+ str_token_counter(msg.name)
return num_tokens


trim_messages(
    messages,
    token_counter=tiktoken_counter,
# Keep the last <= n_count tokens of the messages.
    strategy="last",
# When token_counter=len, each message
# will be counted as a single token.
# Remember to adjust for your use case
    max_tokens=45,
# Most chat models expect that chat history starts with either:
# (1) a HumanMessage or
# (2) a SystemMessage followed by a HumanMessage
    start_on="human",
# Most chat models expect that chat history ends with either:
# (1) a HumanMessage or
# (2) a ToolMessage
    end_on=("human","tool"),
# Usually, we want to keep the SystemMessage
# if it's present in the original history.
# The SystemMessage has special instructions for the model.
    include_system=True,
)

API Reference:BaseMessage |ToolMessage

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what do you call a speechless parrot', additional_kwargs={}, response_metadata={})]

Chaining

trim_messages can be used imperatively (like above) or declaratively, making it easy to compose with other components in a chain

llm= ChatOpenAI(model="gpt-4o")

# Notice we don't pass in messages. This creates
# a RunnableLambda that takes messages as input
trimmer= trim_messages(
    token_counter=llm,
# Keep the last <= n_count tokens of the messages.
    strategy="last",
# When token_counter=len, each message
# will be counted as a single token.
# Remember to adjust for your use case
    max_tokens=45,
# Most chat models expect that chat history starts with either:
# (1) a HumanMessage or
# (2) a SystemMessage followed by a HumanMessage
    start_on="human",
# Most chat models expect that chat history ends with either:
# (1) a HumanMessage or
# (2) a ToolMessage
    end_on=("human","tool"),
# Usually, we want to keep the SystemMessage
# if it's present in the original history.
# The SystemMessage has special instructions for the model.
    include_system=True,
)

chain= trimmer| llm
chain.invoke(messages)

AIMessage(content='A "polly-no-wanna-cracker"!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 32, 'total_tokens': 43, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_90d33c15d4', 'finish_reason': 'stop', 'logprobs': None}, id='run-b1f8b63b-6bc2-4df4-b3b9-dfc4e3e675fe-0', usage_metadata={'input_tokens': 32, 'output_tokens': 11, 'total_tokens': 43, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

Looking at the LangSmith trace we can see that before the messages are passed to the model they are first trimmed:https://smith.langchain.com/public/65af12c4-c24d-4824-90f0-6547566e59bb/r

Looking at just the trimmer, we can see that it's a Runnable object that can be invoked like all Runnables:

trimmer.invoke(messages)

[SystemMessage(content="you're a good assistant, you always respond with a joke.", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what do you call a speechless parrot', additional_kwargs={}, response_metadata={})]

Using with ChatMessageHistory

Trimming messages is especially useful whenworking with chat histories, which can get arbitrarily long:

from langchain_core.chat_historyimport InMemoryChatMessageHistory
from langchain_core.runnables.historyimport RunnableWithMessageHistory

chat_history= InMemoryChatMessageHistory(messages=messages[:-1])


defdummy_get_session_history(session_id):
if session_id!="1":
return InMemoryChatMessageHistory()
return chat_history


trimmer= trim_messages(
    max_tokens=45,
    strategy="last",
    token_counter=llm,
# Usually, we want to keep the SystemMessage
# if it's present in the original history.
# The SystemMessage has special instructions for the model.
    include_system=True,
# Most chat models expect that chat history starts with either:
# (1) a HumanMessage or
# (2) a SystemMessage followed by a HumanMessage
# start_on="human" makes sure we produce a valid chat history
    start_on="human",
)

chain= trimmer| llm
chain_with_history= RunnableWithMessageHistory(chain, dummy_get_session_history)
chain_with_history.invoke(
[HumanMessage("what do you call a speechless parrot")],
    config={"configurable":{"session_id":"1"}},
)

API Reference:InMemoryChatMessageHistory |RunnableWithMessageHistory

AIMessage(content='A "polygon"!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 4, 'prompt_tokens': 32, 'total_tokens': 36, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_c17d3befe7', 'finish_reason': 'stop', 'logprobs': None}, id='run-71d9fce6-bb0c-4bb3-acc8-d5eaee6ae7bc-0', usage_metadata={'input_tokens': 32, 'output_tokens': 4, 'total_tokens': 36})

Looking at the LangSmith trace we can see that we retrieve all of our messages but before the messages are passed to the model they are trimmed to be just the system message and last human message:https://smith.langchain.com/public/17dd700b-9994-44ca-930c-116e00997315/r

API reference

For a complete description of all arguments head to the API reference:https://python.langchain.com/api_reference/core/messages/langchain_core.messages.utils.trim_messages.html

Movatterモバイル変換

Trimming based on token count​

Trimming based on message count​

Advanced Usage​

UsingChatModel as a token counter​

Writing a custom token counter​

Chaining​

Using with ChatMessageHistory​

API reference​