How to stream responses from an LLM

AllLLMs implement theRunnable interface, which comes withdefault implementations of standard runnable methods (i.e.ainvoke,batch,abatch,stream,astream,astream_events).

Thedefault streaming implementations provide anIterator (orAsyncIterator for asynchronous streaming) that yields a single value: the final output from the underlying chat model provider.

The ability to stream the output token-by-token depends on whether the provider has implemented proper streaming support.

See whichintegrations support token-by-token streaming here.

note

Thedefault implementation doesnot provide support for token-by-token streaming, but it ensures that the model can be swapped in for any other model as it supports the same standard interface.

Sync stream

Below we use a| to help visualize the delimiter between tokens.

from langchain_openaiimport OpenAI

llm= OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)
for chunkin llm.stream("Write me a 1 verse song about sparkling water."):
print(chunk, end="|", flush=True)

API Reference:OpenAI



|Spark|ling| water|,| oh| so clear|
|Bubbles dancing|,| without| fear|
|Refreshing| taste|,| a| pure| delight|
|Spark|ling| water|,| my| thirst|'s| delight||

Async streaming

Let's see how to stream in an async setting usingastream.

from langchain_openaiimport OpenAI

llm= OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)
asyncfor chunkin llm.astream("Write me a 1 verse song about sparkling water."):
print(chunk, end="|", flush=True)

API Reference:OpenAI



|Spark|ling| water|,| oh| so clear|
|Bubbles dancing|,| without| fear|
|Refreshing| taste|,| a| pure| delight|
|Spark|ling| water|,| my| thirst|'s| delight||

Async event streaming

LLMs also support the standardastream events method.

tip

astream_events is most useful when implementing streaming in a larger LLM application that contains multiple steps (e.g., an application that involves anagent).

from langchain_openaiimport OpenAI

llm= OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)

idx=0

asyncfor eventin llm.astream_events(
"Write me a 1 verse song about goldfish on the moon", version="v1"
):
    idx+=1
if idx>=5:# Truncate the output
print("...Truncated")
break
print(event)

API Reference:OpenAI

Movatterモバイル変換

Sync stream​

Async streaming​

Async event streaming​

Sync stream

Async streaming

Async event streaming