Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 692 Commits
.github		.github
api		api
client-go		client-go
cmd		cmd
config		config
conformance		conformance
docs		docs
hack		hack
internal		internal
pkg		pkg
site-src		site-src
test		test
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
OWNERS_ALIASES		OWNERS_ALIASES
PROJECT		PROJECT
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
SECURITY_CONTACTS		SECURITY_CONTACTS
bbr.Dockerfile		bbr.Dockerfile
cloudbuild.yaml		cloudbuild.yaml
code-of-conduct.md		code-of-conduct.md
crd-ref-docs.yaml		crd-ref-docs.yaml
go.mod		go.mod
go.sum		go.sum
mkdocs.yml		mkdocs.yml
netlify.toml		netlify.toml

Repository files navigation

Gateway API Inference Extension

Gateway API Inference Extension optimizes self-hosting Generative Models on Kubernetes.This is achieved by leveraging Envoy'sExternal Processing (ext-proc) to extend any gateway that supports both ext-proc andGateway API into aninference gateway.

New!

Inference Gateway has partnered with vLLM to accelerate LLM serving optimizations withllm-d!

Concepts and Definitions

The following specific terms to this project:

Inference Gateway (IGW): A proxy/load-balancer which has been coupled with anEndpoint Picker. It provides optimized routing and load balancing forserving Kubernetes self-hosted generative Artificial Intelligence (AI)workloads. It simplifies the deployment, management, and observability of AIinference workloads.
Inference Scheduler: An extendable component that makes decisions about which endpoint is optimal (best cost /best performance) for an inference request based onMetrics and CapabilitiesfromModel Serving.
Metrics and Capabilities: Data provided by model serving platforms aboutperformance, availability and capabilities to optimize routing. Includesthings likePrefix Cache status orLoRA Adapters availability.
Endpoint Picker(EPP): An implementation of anInference Scheduler with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPPhere.

The following are key industry terms that are important to understand forthis project:

Model: A generative AI model that has learned patterns from data and isused for inference. Models vary in size and architecture, from smallerdomain-specific models to massive multi-billion parameter neural networks thatare optimized for diverse language tasks.
Inference: The process of running a generative AI model, such as a largelanguage model, diffusion model etc, to generate text, embeddings, or otheroutputs from input data.
Model server: A service (in our case, containerized) responsible forreceiving inference requests and returning predictions from a model.
Accelerator: specialized hardware, such as Graphics Processing Units(GPUs) that can be attached to Kubernetes nodes to speed up computations,particularly for training and inference tasks.

For deeper insights and more advanced concepts, refer to ourproposals.

Technical Overview

This extension upgrades anext-proc capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become aninference gateway - supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes. This integration makes it easy to expose and control access to your localOpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher levelAI Gateway like LiteLLM, Solo AI Gateway, or Apigee.

The Inference Gateway:

Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
ProvidesKubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
Adds end to end observability around service objective attainment
Ensures operational guardrails between different client model names, allowing a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators

Model Server Integration

IGW’s pluggable architecture was leveraged to enable thellm-d Inference Scheduler.

Llm-d customizes vLLM & IGW to create a disaggregated serving solution. We've worked closely with this team to enable this integration. IGW will continue to work closely with llm-d to generalize the disaggregated serving plugin(s), & set a standard for disaggregated serving to be used across anyprotocol-adherent model server.

IGW has enhanced support for vLLM via llm-d, and broad support for any model servers implementing the protocol. More details can be found inmodel server integration.

Status

This project is in alpha. latest release can be foundhere.
It should not be used in production yet.

Getting Started

Follow ourGetting Started Guide to get the inference-extension up and running on your cluster!

Seeour website for detailed API documentation on leveraging our Kubernetes-native declarative APIs

Roadmap

As Inference Gateway builds towards a GA release. We will continue to expand our capabilities, namely:

Prefix-cache aware load balancing with interfaces for remote caches
Recommended LoRA adapter pipeline for automated rollout
Fairness and priority between workloads within the same criticality band
HPA support for autoscaling on aggregate metrics derived from the load balancer
Support for large multi-modal inputs and outputs
Support for other GenAI model types (diffusion and other non-completion protocols)
Heterogeneous accelerators - serve workloads on multiple types of accelerator using latency and request cost-aware load balancing
Disaggregated serving support with independently scaling pools