TensorZero Autopilot is an automated AI engineer that analyzes LLM observability data, optimizes prompts and models, sets up evals, and runs A/B tests.Join the waitlist →
Learn how to use retries and fallbacks to handle errors and improve reliability with TensorZero.
routing field.If we include multiple providers on the list, the gateway will try each one sequentially until one succeeds or all fail.In the example below, the gateway will first try OpenAI, and if that fails, it will try Azure.[models.gpt_4o_mini]# Try the following providers in order:# 1. `models.gpt_4o_mini.providers.openai`# 2. `models.gpt_4o_mini.providers.azure`routing = ["openai","azure"][models.gpt_4o_mini.providers.openai]type ="openai"model_name ="gpt-4o-mini-2024-07-18"[models.gpt_4o_mini.providers.azure]type ="azure"deployment_id ="gpt4o-mini-20240718"endpoint ="https://your-azure-openai-endpoint.openai.azure.com"[functions.extract_data]type ="chat"[functions.extract_data.variants.gpt_4o_mini]type ="chat_completion"model ="gpt_4o_mini"routing fallback applies to each individual model inference separately.retries field to a variant to specify the number of times to retry that variant if it fails.The retry strategy is a truncated exponential backoff with jitter.In the example below, the gateway will retry the variant four times (i.e. a total of five attempts), with a maximum delay of 10 seconds between retries.[functions.extract_data]type ="chat"[functions.extract_data.variants.claude_haiku_4_5]type ="chat_completion"model ="anthropic::claude-haiku-4-5"# Retry the variant up to four times, with a maximum delay of 10 seconds between retries.retries = {num_retries =4,max_delay_s =10 }[functions.function_name.experimentation] section.In the example below, both variants have an equal chance of being selected:[functions.draft_email]type ="chat"[functions.draft_email.variants.gpt_5_mini]type ="chat_completion"model ="openai::gpt-5-mini"[functions.draft_email.variants.claude_haiku_4_5]type ="chat_completion"model ="anthropic::claude-haiku-4-5"gpt_5_mini orclaude_haiku_4_5.If both of those variants fail, the gateway will try the fallback variants in order: firstgrok_4, thengemini_2_5_flash.[functions.extract_data]type ="chat"[functions.extract_data.experimentation]type ="uniform"candidate_variants = ["gpt_5_mini","claude_haiku_4_5"]fallback_variants = ["grok_4","gemini_2_5_flash"][functions.draft_email.variants.gpt_5_mini]type ="chat_completion"model ="openai::gpt-5-mini"[functions.draft_email.variants.claude_haiku_4_5]type ="chat_completion"model ="anthropic::claude-haiku-4-5"[functions.draft_email.variants.grok_4]type ="chat_completion"model ="xai::grok-4-0709"[functions.draft_email.variants.gemini_2_5_flash]type ="chat_completion"model ="google_ai_studio_gemini::gemini-2.5-flash"gpt_5_mini 70% of the time andclaude_haiku_4_5 30% of the time.If both of those variants fail, the gateway will try the fallback variants sequentially.[functions.extract_data.experimentation]type ="static_weights"candidate_variants = {"gpt_5_mini" =0.7,"claude_haiku_4_5" =0.3}fallback_variants = ["grok_4","gemini_2_5_flash"]while variants: # Sample according to experimentation config (uniform, static_weights, etc.) variant= sample_variant(variants)# sampling without replacement for _in range(num_retries+ 1): for providerin variant.routing: try: return inference(variant, provider) except: continuegpt_4o_mini_api_key_A andgpt_4o_mini_api_key_B).Each variant leverages a model with providers that use different API keys (OPENAI_API_KEY_A andOPENAI_API_KEY_B).SeeCredential Management for more details on credential management.[models.gpt_4o_mini_api_key_A]routing = ["openai"][models.gpt_4o_mini_api_key_A.providers.openai]type ="openai"model_name ="gpt-4o-mini-2024-07-18"api_key_location ="env:OPENAI_API_KEY_A"[models.gpt_4o_mini_api_key_B]routing = ["openai"][models.gpt_4o_mini_api_key_B.providers.openai]type ="openai"model_name ="gpt-4o-mini-2024-07-18"api_key_location ="env:OPENAI_API_KEY_B"[functions.extract_data]type ="chat"# Uniform sampling (default) splits traffic equally[functions.extract_data.variants.gpt_4o_mini_api_key_A]type ="chat_completion"model ="gpt_4o_mini_api_key_A"[functions.extract_data.variants.gpt_4o_mini_api_key_B]type ="chat_completion"model ="gpt_4o_mini_api_key_B"timeouts field in the corresponding configuration block.You can define timeouts for non-streaming and streaming requests separately:timeouts.non_streaming.total_ms corresponds to the total request duration andtimeouts.streaming.ttft_ms corresponds to the time to first token (TTFT).For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT) to a particular model provider.[models.model_name.providers.provider_name]# ...timeouts = {non_streaming.total_ms =15000,streaming.ttft_ms =3000 }# ...timeout field (or simply killing the request if you’re using a different client).Embedding models and embedding model providers support atimeout_ms configuration field.gateway.global_outbound_http_timeout_ms in your configuration.By default, this is set to 15 minutes to accommodate slow model responses.[gateway]global_outbound_http_timeout_ms =900_000 # 15 minutes