Integrations

Introduction

Gateway

Optimization

Overview

Evaluations

Overview

Experimentation

Deployment

Operations

API Reference

API Reference: Batch Inference

API reference for the Batch Inference endpoints.

The/batch_inference endpoints allow users to take advantage of batched inference offered by LLM providers.These inferences are often substantially cheaper than the synchronous APIs.The handling and eventual data model for inferences made through this endpoint are equivalent to those made through the main/inference endpoint with a few exceptions:

The batch samples a single variant from the function being called.
There are no fallbacks or retries for bached functions.
Only variants of typechat_completion are supported.
Caching is not supported.
Thedryrun setting is not supported.
Streaming is not supported.

Under the hood, the gateway validates all of the requests, samples a single variant from the function being called, handles templating when applicable, and routes the inference to the appropriate model provider.In the batch endpoint there are no fallbacks as the requests are processed asynchronously.The typical workflow is to first use thePOST /batch_inference endpoint to submit a batch of requests.Later, you can poll theGET /batch_inference/{batch_id} orGET /batch_inference/:batch_id/inference/:inference_id endpoint to check the status of the batch and retrieve results.Each poll will return either a pending or failed status or the results of the batch.Even after a batch has completed and been processed, you can continue to poll the endpoint as a way of retrieving the results.The first time a batch has completed and been processed, the results are stored in the ChatInference, JsonInference, and ModelInference tables as with the/inference endpoint.The gateway will rehydrate the results into the expected result when polled repeatedly after finishing

See theBatch Inference Guide for a simple example of using the batch inference endpoints.

`POST /batch_inference`

Request

`additional_tools`

Type: list of lists of tools (see below)
Required: no (default: no additional tools)

A list of lists of tools defined at inference time that the model is allowed to call.This field allows for dynamic tool use, i.e. defining tools at runtime.Each element in the outer list corresponds to a single inference in the batch.Each inner list contains the tools that should be available to the corresponding inference.You should prefer to define tools in the configuration file if possible.Only use this field if dynamic tool use is necessary for your use case.Each tool is an object with the following fields:description,name,parameters, andstrict.The fields are identical to those in the configuration file, except that theparameters field should contain the JSON schema itself rather than a path to it.SeeConfiguration Reference for more details.

`allowed_tools`

Type: list of lists of strings
Required: no

A list of lists of tool names that the model is allowed to call.The tools must be defined in the configuration file or provided dynamically viaadditional_tools.Each element in the outer list corresponds to a single inference in the batch.Each inner list contains the names of the tools that are allowed for the corresponding inference.Some providers (notably OpenAI) natively support restricting allowed tools.For these providers, we send all tools (both configured and dynamic) to the provider, and separately specify which ones are allowed to be called.For providers that do not natively support this feature, we filter the tool list ourselves and only send the allowed tools to the provider.

`credentials`

Type: object (a map from dynamic credential names to API keys)
Required: no (default: no credentials)

Each model provider in your TensorZero configuration can be configured to accept credentials at inference time by using thedynamic location (e.g.dynamic::my_dynamic_api_key_name).See theconfiguration reference for more details.The gateway expects the credentials to be provided in thecredentials field of the request body as specified below.The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.

Example

[models.my_model_name.providers.my_provider_name]# ...# Note: the name of the credential field (e.g. `api_key_location`) depends on the provider typeapi_key_location ="dynamic::my_dynamic_api_key_name"# ...

{  // ...  "credentials": {    // ...    "my_dynamic_api_key_name":"sk-..."    // ...  }  // ...}

`episode_ids`

Type: list of UUIDs
Required: no

The IDs of existing episodes to associate the inferences with.Each element in the list corresponds to a single inference in the batch.You can providenull for episode IDs for elements that should start a fresh episode.Only use episode IDs that were returned by the TensorZero gateway.

`function_name`

Type: string
Required: yes

The name of the function to call. This function will be the same for all inferences in the batch.The function must be defined in the configuration file.

`inputs`

Type: list ofinput objects (see below)
Required: yes

The input to the function.Each element in the list corresponds to a single inference in the batch.

`input[].messages`

Type: list of messages (see below)
Required: no (default:[])

A list of messages to provide to the model.Each message is an object with the following fields:

role: The role of the message (assistant oruser).
content: The content of the message (see below).

Thecontent field can be have one of the following types:

string: the text for a text message (only allowed if there is no schema for that role)
list of content blocks: the content blocks for the message (see below)

A content block is an object with the fieldtype and additional fields depending on the type.If the content block has typetext, it must have either of the following additional fields:

text: The text for the content block.
arguments: A JSON object containing the function arguments for TensorZero functions with templates and schemas (seeCreate a prompt template for details).

If the content block has typetool_call, it must have the following additional fields:

arguments: The arguments for the tool call.
id: The ID for the content block.
name: The name of the tool for the content block.

If the content block has typetool_result, it must have the following additional fields:

id: The ID for the content block.
name: The name of the tool for the content block.
result: The result of the tool call.

If the content block has typefile, it must have exactly one of the following additional fields:

File URLs
- file_type: must beurl
- url
- mime_type (optional): override the MIME type of the file
- filename (optional): a filename to associate with the file
Base64-encoded Files
- file_type: must bebase64
- data:base64-encoded data for an embedded file
- mime_type: the MIME type (e.g.image/png,image/jpeg,application/pdf)
- filename (optional): a filename to associate with the file

See theMultimodal Inference guide for more details on how to use images in inference.If the content block has typeraw_text, it must have the following additional fields:

value: The text for the content block.This content block will ignore any relevant templates and schemas for this function.

If the content block has typethought, it must have the following additional fields:

text: The text for the content block.

If the content block has typeunknown, it must have the following additional fields:

data: The original content block from the provider, without any validation or transformation by TensorZero.
model_provider_name (optional): A string specifying when this content block should be included in the model provider input.If set, the content block will only be provided to this specific model provider.If not set, the content block is passed to all model providers.

For example, the following hypothetical unknown content block will send thedaydreaming content block to inference requests targeting theyour_model_provider_name model provider.

{  "type":"unknown",  "data": {    "type":"daydreaming",    "dream":"..."  },  "model_provider_name":"tensorzero::model_name::your_model_name::provider_name::your_model_provider_name"}

This is the most complex field in the entire API. See this example for more details.

Example

{  // ...  "input": {    "messages": [      // If you don't have a user (or assistant) schema...      {        "role":"user",// (or "assistant")        "content":"What is the weather in Tokyo?"      },      // If you have a user (or assistant) schema...      {        "role":"user",// (or "assistant")        "content": [          {            "type":"text",            "arguments": {              "location":"Tokyo"              // ...            }          }        ]      },      // If the model previously called a tool...      {        "role":"assistant",        "content": [          {            "type":"tool_call",            "id":"0",            "name":"get_temperature",            "arguments":"{\"location\":\"Tokyo\"}"          }        ]      },      // ...and you're providing the result of that tool call...      {        "role":"user",        "content": [          {            "type":"tool_result",            "id":"0",            "name":"get_temperature",            "result":"70"          }        ]      },      // You can also specify a text message using a content block...      {        "role":"user",        "content": [          {            "type":"text",            "text":"What about NYC?" // (or object if there is a schema)          }        ]      },      // You can also provide multiple content blocks in a single message...      {        "role":"assistant",        "content": [          {            "type":"text",            "text":"Sure, I can help you with that." // (or object if there is a schema)          },          {            "type":"tool_call",            "id":"0",            "name":"get_temperature",            "arguments":"{\"location\":\"New York\"}"          }        ]      }      // ...    ]    // ...  }  // ...}

`input[].system`

Type: string or object
Required: no

The input for the system message.If the function does not have a system schema, this field should be a string.If the function has a system schema, this field should be an object that matches the schema.

`output_schemas`

Type: list of optional objects (valid JSON Schema)
Required: no

A list of JSON schemas that will be used to validate the output of the function for each inference in the batch.Each element in the list corresponds to a single inference in the batch.These can be null for elements that need to use theoutput_schema defined in the function configuration.This schema is used for validating the output of the function, and sent to providers which support structured outputs.

`parallel_tool_calls`

Type: list of optional booleans
Required: no

A list of booleans that indicate whether each inference in the batch should be allowed to request multiple tool calls in a single conversation turn.Each element in the list corresponds to a single inference in the batch.You can providenull for elements that should use the configuration value for the function being called.If you don’t provide this field entirely, we default to the configuration value for the function being called.Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field.At the moment, only Fireworks AI and OpenAI support parallel tool calls.

`params`

Type: object (see below)
Required: no (default:{})

Override inference-time parameters for a particular variant type.This fields allows for dynamic inference parameters, i.e. defining parameters at runtime.This field’s format is{ variant_type: { param: [value1, ...], ... }, ... }.You should prefer to set these parameters in the configuration file if possible.Only use this field if you need to set these parameters dynamically at runtime.Each parameter if specified should be a list of values that may be null that is the same length as the batch size.Note that the parameters will apply to every variant of the specified type.Currently, we support the following:

chat_completion
- frequency_penalty
- json_mode
- max_tokens
- presence_penalty
- reasoning_effort
- seed
- service_tier
- stop_sequences
- temperature
- thinking_budget_tokens
- top_p
- verbosity

SeeConfiguration Reference for more details on the parameters, and Examples below for usage.

Example

For example, if you wanted to dynamically override thetemperature parameter for achat_completion variant for the first inference in a batch of 3, you’d include the following in the request body:

{  // ...  "params": {    "chat_completion": {      "temperature": [0.7,null,null]    }  }  // ...}

`tags`

Type: list of optional JSON objects with string keys and values
Required: no

User-provided tags to associate with the inference.Each element in the list corresponds to a single inference in the batch.For example,[{"user_id": "123"}, null] or[{"author": "Alice"}, {"author": "Bob"}].

`tool_choice`

Type: list of optional strings
Required: no

If set, overrides the tool choice strategy for the equest.Each element in the list corresponds to a single inference in the batch.The supported tool choice strategies are:

none: The function should not use any tools.
auto: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.
required: The model should use a tool. If multiple tools are available, the model decides which tool to use.
{ specific = "tool_name" }: The model should use a specific tool. The tool must be defined in thetools section of the configuration file or provided inadditional_tools.

`variant_name`

Type: string
Required: no

If set, pins the batch inference request to a particular variant (not recommended).You should generally not set this field, and instead let the TensorZero gateway assign a variant.This field is primarily used for testing or debugging purposes.

Response

For a POST request to/batch_inference, the response is a JSON object containing metadata that allows you to refer to the batch and poll it later on.The response is an object with the following fields:

`batch_id`

Type: UUID

The ID of the batch.

`inference_ids`

Type: list of UUIDs

The IDs of the inferences in the batch.

`episode_ids`

Type: list of UUIDs

The IDs of the episodes associated with the inferences in the batch.

Example

Imagine you have a simple TensorZero function that generates haikus using GPT-4o Mini.

[functions.generate_haiku]type ="chat"[functions.generate_haiku.variants.gpt_4o_mini]type ="chat_completion"model ="openai::gpt-4o-mini-2024-07-18"

You can submit a batch inference job to generate multiple haikus with a single request.Each entry ininputs is equal to theinput field in a regular inference request.

curl -X POST http://localhost:3000/batch_inference \  -H "Content-Type: application/json" \  -d '{    "function_name": "generate_haiku",    "variant_name": "gpt_4o_mini",    "inputs": [      {        "messages": [          {            "role": "user",            "content": "Write a haiku about artificial intelligence."          }        ]      },      {        "messages": [          {            "role": "user",            "content": "Write a haiku about general aviation."          }        ]      },      {        "messages": [          {            "role": "user",            "content": "Write a haiku about anime."          }        ]      }    ]  }'

The response contains abatch_id as well asinference_ids andepisode_ids for each inference in the batch.

{  "batch_id":"019470f0-db4c-7811-9e14-6fe6593a2652",  "inference_ids": [    "019470f0-d34a-77a3-9e59-bcc66db2b82f",    "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",    "019470f0-d34a-77a3-9e59-bcecfb7172a0"  ],  "episode_ids": [    "019470f0-d34a-77a3-9e59-bc933973d087",    "019470f0-d34a-77a3-9e59-bca6e9b748b2",    "019470f0-d34a-77a3-9e59-bcb20177bf3a"  ]}

`GET /batch_inference/:batch_id`

Both this and the following GET endpoint can be used to poll the status of a batch.If you use this endpoint and poll with only the batch ID the entire batch will be returned if possible.The response format depends on the function type as well as the batch status when polled.

Pending

{"status": "pending"}

Failed

{"status": "failed"}

Completed

`status`

Type: literal string"completed"

`batch_id`

Type: UUID

`inferences`

Type: list of objects that exactly match the response body in the inference endpoint documentedhere.

Example

Extending the example from above: you can use thebatch_id to poll the status of this job:

curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652

While the job is pending, the response will only contain thestatus field.

{  "status":"pending"}

Once the job is completed, the response will contain thestatus field and theinferences field.Each inference object is the same as the response from a regular inference request.

{  "status":"completed",  "batch_id":"019470f0-db4c-7811-9e14-6fe6593a2652",  "inferences": [    {      "inference_id":"019470f0-d34a-77a3-9e59-bcc66db2b82f",      "episode_id":"019470f0-d34a-77a3-9e59-bc933973d087",      "variant_name":"gpt_4o_mini",      "content": [        {          "type":"text",          "text":"Whispers of circuits,\nLearning paths through endless code,\nDreams in binary."        }      ],      "usage": {        "input_tokens":15,        "output_tokens":19      }    },    {      "inference_id":"019470f0-d34a-77a3-9e59-bcdd2f8e06aa",      "episode_id":"019470f0-d34a-77a3-9e59-bca6e9b748b2",      "variant_name":"gpt_4o_mini",      "content": [        {          "type":"text",          "text":"Wings of freedom soar,\nClouds embrace the lonely flight,\nSky whispers adventure."        }      ],      "usage": {        "input_tokens":15,        "output_tokens":20      }    },    {      "inference_id":"019470f0-d34a-77a3-9e59-bcecfb7172a0",      "episode_id":"019470f0-d34a-77a3-9e59-bcb20177bf3a",      "variant_name":"gpt_4o_mini",      "content": [        {          "type":"text",          "text":"Vivid worlds unfold,\nHeroes rise with dreams in hand,\nInk and dreams collide."        }      ],      "usage": {        "input_tokens":14,        "output_tokens":20      }    }  ]}

`GET /batch_inference/:batch_id/inference/:inference_id`

This endpoint can be used to poll the status of a single inference in a batch.Since the polling involves pulling data on all the inferences in the batch, we also store the status of all those inference in ClickHouse.The response format depends on the function type as well as the batch status when polled.

Pending

{"status": "pending"}

Failed

{"status": "failed"}

Completed

`status`

Type: literal string"completed"

`batch_id`

Type: UUID

`inferences`

Type: list containing a single object that exactly matches the response body in the inference endpoint documentedhere.

Example

Similar to above, we can also poll a particular inference:

curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652/inference/019470f0-d34a-77a3-9e59-bcc66db2b82f

While the job is pending, the response will only contain thestatus field.

{  "status":"pending"}

Once the job is completed, the response will contain thestatus field and theinferences field.Unlike above, this request will return a list containing only the requested inference.

{  "status":"completed",  "batch_id":"019470f0-db4c-7811-9e14-6fe6593a2652",  "inferences": [    {      "inference_id":"019470f0-d34a-77a3-9e59-bcc66db2b82f",      "episode_id":"019470f0-d34a-77a3-9e59-bc933973d087",      "variant_name":"gpt_4o_mini",      "content": [        {          "type":"text",          "text":"Whispers of circuits,\nLearning paths through endless code,\nDreams in binary."        }      ],      "usage": {        "input_tokens":15,        "output_tokens":19      }    }  ]}

Feedback Datasets & Datapoints

⌘I

Movatterモバイル変換

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

​POST /batch_inference

​Request

​additional_tools

​allowed_tools

​credentials

​episode_ids

​function_name

​inputs

input[].messages

input[].system

​output_schemas

​parallel_tool_calls

​params

​tags

​tool_choice

​variant_name

​Response

​batch_id

​inference_ids

​episode_ids

​Example

​GET /batch_inference/:batch_id

​Pending

​Failed

​Completed

​status

​batch_id

​inferences

​Example

​GET /batch_inference/:batch_id/inference/:inference_id

​Pending

​Failed

​Completed

​status

​batch_id

​inferences

​Example

`POST /batch_inference`

Request

`additional_tools`

`allowed_tools`

`credentials`

`episode_ids`

`function_name`

`inputs`

`input[].messages`

`input[].system`

`output_schemas`

`parallel_tool_calls`

`params`

`tags`

`tool_choice`

`variant_name`

Response

`batch_id`

`inference_ids`

`episode_ids`

Example

`GET /batch_inference/:batch_id`

Pending

Failed

Completed

`status`

`batch_id`

`inferences`

Example

`GET /batch_inference/:batch_id/inference/:inference_id`

Pending

Failed

Completed

`status`

`batch_id`

`inferences`

Example