Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Optimizing inference proxy for LLMs

License

NotificationsYou must be signed in to change notification settings

codelion/optillm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs. The current focus is on implementing techniques that improve reasoning over coding, logical and mathematical queries.

It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time. A good example of how to combine such techniques together is theCePO approach from Cerebras.

Open in SpacesOpen In ColabGitHub Discussions

Installation

Using pip

pip install optillmoptillm             2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto

Using docker

docker pull ghcr.io/codelion/optillm:latestdocker run -p 8000:8000 ghcr.io/codelion/optillm:latest2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto

To use optillm without local inference and only as a proxy you can add the-proxy suffix.

docker pull ghcr.io/codelion/optillm:latest-proxy

Install from source

Clone the repository withgit and usepip install to setup the dependencies.

git clone https://github.com/codelion/optillm.gitcd optillmpython3 -m venv .venvsource .venv/bin/activatepip install -r requirements.txt

We support all major LLM providers and models for inference. You need to set the correct environment variable and the proxy will pick the corresponding client.

ProviderRequired Environment VariablesAdditional Notes
OptiLLMOPTILLM_API_KEYUses the inbuilt local server for inference, supports logprobs and decoding techniques likecot_decoding &entropy_decoding
OpenAIOPENAI_API_KEYYou can use this with any OpenAI compatible endpoint (e.g. OpenRouter) by setting thebase_url
CerebrasCEREBRAS_API_KEYYou can use this for fast inference with supported models, seedocs for details
Azure OpenAIAZURE_OPENAI_API_KEY
AZURE_API_VERSION
AZURE_API_BASE
-
Azure OpenAI (Managed Identity)AZURE_API_VERSION
AZURE_API_BASE
Login required usingaz login, seedocs for details
LiteLLMdepends on the modelSeedocs for details

You can then run the optillm proxy as follows.

python optillm.py2024-09-06 07:57:14,191 - INFO - Starting server with approach: auto2024-09-06 07:57:14,191 - INFO - Server configuration: {'approach':'auto','mcts_simulations': 2,'mcts_exploration': 0.2,'mcts_depth': 1,'best_of_n': 3,'model':'gpt-4o-mini','rstar_max_depth': 3,'rstar_num_rollouts': 5,'rstar_c': 1.4,'base_url':''}* Serving Flask app'optillm'* Debug mode: off2024-09-06 07:57:14,212 - INFO - WARNING: This is a development server. Do not use itin a production deployment. Use a production WSGI server instead.* Running on all addresses (0.0.0.0)* Running on http://127.0.0.1:8000* Running on http://192.168.10.48:80002024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit

Usage

Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting thebase_url ashttp://localhost:8000/v1.

importosfromopenaiimportOpenAIOPENAI_KEY=os.environ.get("OPENAI_API_KEY")OPENAI_BASE_URL="http://localhost:8000/v1"client=OpenAI(api_key=OPENAI_KEY,base_url=OPENAI_BASE_URL)response=client.chat.completions.create(model="moa-gpt-4o",messages=[    {"role":"user","content":"Write a Python program to build an RL model to recite text from any position that the user provides, using only numpy."    }  ],temperature=0.2)print(response)

The code above applies to both OpenAI and Azure OpenAI, just remember to populate theOPENAI_API_KEY env variable with the proper key.There are multiple ways to control the optimization techniques, they are applied in the follow order of preference:

  • You can control the technique you use for optimization by prepending the slug to the model name{slug}-model-name. E.g. in the above code we are usingmoa or mixture of agents as the optimization approach. In the proxy logs you will see the following showing themoa is been used with the base model asgpt-4o-mini.
2024-09-06 08:35:32,597 - INFO - Using approach moa, with gpt-4o-mini2024-09-06 08:35:35,358 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions"HTTP/1.1 200 OK"2024-09-06 08:35:39,553 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions"HTTP/1.1 200 OK"2024-09-06 08:35:44,795 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions"HTTP/1.1 200 OK"2024-09-06 08:35:44,797 - INFO - 127.0.0.1 - - [06/Sep/2024 08:35:44]"POST /v1/chat/completions HTTP/1.1" 200 -
  • Or, you can pass the slug in theoptillm_approach field in theextra_body.
response = client.chat.completions.create(  model="gpt-4o-mini",  messages=[{"role":"user","content":"" }],  temperature=0.2,  extra_body={"optillm_approach":"bon|moa|mcts"})
  • Or, you can just mention the approach in either yoursystem oruser prompt, within<optillm_approach> </optillm_approach> tags.
response = client.chat.completions.create(  model="gpt-4o-mini",  messages=[{"role":"user","content":"<optillm_approach>re2</optillm_approach> How many r's are there in strawberry?" }],  temperature=0.2)

Tip

You can also combine different techniques either by using symbols& and|. When you use& the techniques are processed in the order from left to right in a pipelinewith response from previous stage used as request to the next. While, with| we run all the requests in parallel and generate multiple responses that are returned as a list.

Please note that the convention described above works only when the optillm server has been started with inference approach set toauto. Otherwise, themodel attribute in the client request must be set with the model name only.

We now suport all LLM providers (by wrapping around theLiteLLM sdk). E.g. you can use the Gemini Flash model withmoa by setting passing the api key in the environment variableos.environ['GEMINI_API_KEY'] and then calling the modelmoa-gemini/gemini-1.5-flash-002. In the output you will then see that LiteLLM is being used to call the base model.

9:43:21 - LiteLLM:INFO: utils.py:2952 - LiteLLMcompletion() model= gemini-1.5-flash-002; provider = gemini2024-09-29 19:43:21,011 - INFO - LiteLLMcompletion() model= gemini-1.5-flash-002; provider = gemini2024-09-29 19:43:21,481 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-002:generateContent?key=[redacted]"HTTP/1.1 200 OK"19:43:21 - LiteLLM:INFO: utils.py:988 - Wrapper: Completed Call, calling success_handler2024-09-29 19:43:21,483 - INFO - Wrapper: Completed Call, calling success_handler19:43:21 - LiteLLM:INFO: utils.py:2952 - LiteLLMcompletion() model= gemini-1.5-flash-002; provider = gemini

Tip

optillm is a transparent proxy and will work with any LLM API or provider that has an OpenAI API compatible chat completions endpoint, and in turn, optillm also exposesthe same OpenAI API compatible chat completions endpoint. This should allow you to integrate it into any existing tools or frameworks easily. If the LLM you want to usedoesn't have an OpenAI API compatible endpoint (like Google or Anthropic) you can useLiteLLM proxy server that supports most LLMs.

The following sequence diagram illustrates how the request and responses go through optillm.

Sequance diagram showing optillm in use

In the diagram:

  • A is an existing tool (likeoobabooga), framework (likepatchwork)or your own code where you want to use the results from optillm. You can use it directly using any OpenAI client sdk.
  • B is the optillm service (running directly or in a docker container) that will send requests to thebase_url.
  • C is any service providing an OpenAI API compatible chat completions endpoint.

Local inference server

We support loading any HuggingFace model or LoRA directly in optillm. To use the built-in inference server set theOPTILLM_API_KEY to any value (e.g.export OPTILLM_API_KEY="optillm")and then use the same in your OpenAI client. You can pass any HuggingFace model in model field. If it is a private model make sure you set theHF_TOKEN environment variablewith your HuggingFace key. We also support adding any number of LoRAs on top of the model by using the+ separator.

E.g. The following code loads the base modelmeta-llama/Llama-3.2-1B-Instruct and then adds two LoRAs on top -patched-codes/Llama-3.2-1B-FixVulns andpatched-codes/Llama-3.2-1B-FastApply.You can specify which LoRA to use using theactive_adapter param inextra_args field of OpenAI SDK client. By default we will load the last specified adapter.

OPENAI_BASE_URL="http://localhost:8000/v1"OPENAI_KEY="optillm"response=client.chat.completions.create(model="meta-llama/Llama-3.2-1B-Instruct+patched-codes/Llama-3.2-1B-FastApply+patched-codes/Llama-3.2-1B-FixVulns",messages=messages,temperature=0.2,logprobs=True,top_logprobs=3,extra_body={"active_adapter":"patched-codes/Llama-3.2-1B-FastApply"},)

You can also use the alternate decoding techniques likecot_decoding andentropy_decoding directly with the local inference server.

response=client.chat.completions.create(model="meta-llama/Llama-3.2-1B-Instruct",messages=messages,temperature=0.2,extra_body={"decoding":"cot_decoding",# or "entropy_decoding"# CoT specific params"k":10,"aggregate_paths":True,# OR Entropy specific params"top_k":27,"min_p":0.03,    })

Starting the optillm proxy with an external server (e.g. llama.cpp or ollama)

  • Set theOPENAI_API_KEY env variable to a placeholder value
    • e.g.export OPENAI_API_KEY="sk-no-key"
  • Run./llama-server -c 4096 -m path_to_model to start the server with the specified model and a context length of 4096 tokens
  • Runpython3 optillm.py --base_url base_url to start the proxy
    • e.g. for llama.cpp, runpython3 optillm.py --base_url http://localhost:8080/v1

Warning

The Anthropic API, llama.cpp-server, and ollama currently do not support sampling multiple responses from a model, which limits the available approaches to the following:cot_reflection,leap,plansearch,rstar,rto,self_consistency,re2, andz3. For models on HuggingFace, you can use the built-in local inference server as it supports multiple responses.

MCP Plugin

The Model Context Protocol (MCP) plugin enables OptiLLM to connect with MCP servers, bringing external tools, resources, and prompts into the context of language models. This allows for powerful integrations with filesystem access, database queries, API connections, and more.

What is MCP?

TheModel Context Protocol (MCP) is an open protocol standard that allows LLMs to securely access tools and data sources through a standardized interface. MCP servers can provide:

  • Tools: Callable functions that perform actions (like writing files, querying databases, etc.)
  • Resources: Data sources for providing context (like file contents)
  • Prompts: Reusable prompt templates for specific use cases

Configuration

Setting up MCP Config
  1. Create a configuration file at~/.optillm/mcp_config.json with the following structure:
{"mcpServers": {"filesystem": {"command":"npx","args": ["-y","@modelcontextprotocol/server-filesystem","/path/to/allowed/directory1","/path/to/allowed/directory2"      ],"env": {}    }  },"log_level":"INFO"}

Each server entry inmcpServers consists of:

  • Server name: A unique identifier for the server (e.g., "filesystem")
  • command: The executable to run the server
  • args: Command-line arguments for the server
  • env: Environment variables for the server process
  • description (optional): Description of the server's functionality

Available MCP Servers

You can use any of theofficial MCP servers or third-party servers. Some popular options include:

  • Filesystem:@modelcontextprotocol/server-filesystem - File operations
  • Git:mcp-server-git - Git repository operations
  • SQLite:@modelcontextprotocol/server-sqlite - SQLite database access
  • Brave Search:@modelcontextprotocol/server-brave-search - Web search capabilities

Example configuration for multiple servers:

{"mcpServers": {"filesystem": {"command":"npx","args": ["-y","@modelcontextprotocol/server-filesystem","/home/user/documents"],"env": {}    },"search": {"command":"npx","args": ["-y","@modelcontextprotocol/server-brave-search"],"env": {"BRAVE_API_KEY":"your-api-key-here"      }    }  },"log_level":"INFO"}

Using the MCP Plugin

Once configured, the MCP plugin will automatically:

  1. Connect to all configured MCP servers
  2. Discover available tools, resources, and prompts
  3. Make these capabilities available to the language model
  4. Handle tool calls and resource requests

The plugin enhances the system prompt with MCP capabilities so the model knows which tools are available. When the model decides to use a tool, the plugin:

  1. Executes the tool with the provided arguments
  2. Returns the results to the model
  3. Allows the model to incorporate the results into its response

Example Queries

Here are some examples of queries that will engage MCP tools:

  • "List all the Python files in my documents directory" (Filesystem)
  • "What are the recent commits in my Git repository?" (Git)
  • "Search for the latest information about renewable energy" (Search)
  • "Query my database for all users who registered this month" (Database)

Troubleshooting

Logs

The MCP plugin logs detailed information to:

~/.optillm/logs/mcp_plugin.log

Check this log file for connection issues, tool execution errors, and other diagnostic information.

Common Issues
  1. Command not found: Make sure the server executable is available in your PATH, or use an absolute path in the configuration.

  2. Connection failed: Verify the server is properly configured and any required API keys are provided.

  3. Method not found: Some servers don't implement all MCP capabilities (tools, resources, prompts). Verify which capabilities the server supports.

  4. Access denied: For filesystem operations, ensure the paths specified in the configuration are accessible to the process.

Implemented techniques

ApproachSlugDescription
Cerebras Planning and OptimimizationcepoCombines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques
CoT with Reflectioncot_reflectionImplements chain-of-thought reasoning with <thinking>, <reflection> and <output> sections
PlanSearchplansearchImplements a search algorithm over candidate plans for solving a problem in natural language
ReReadre2Implements rereading to improve reasoning by processing queries twice
Self-Consistencyself_consistencyImplements an advanced self-consistency method
Z3 Solverz3Utilizes the Z3 theorem prover for logical reasoning
R* AlgorithmrstarImplements the R* algorithm for problem-solving
LEAPleapLearns task-specific principles from few shot examples
Round Trip OptimizationrtoOptimizes responses through a round-trip process
Best of N SamplingbonGenerates multiple responses and selects the best one
Mixture of AgentsmoaCombines responses from multiple critiques
Monte Carlo Tree SearchmctsUses MCTS for decision-making in chat responses
PV GamepvgApplies a prover-verifier game approach at inference time
CoT DecodingN/A for proxyImplements chain-of-thought decoding to elicit reasoning without explicit prompting
Entropy DecodingN/A for proxyImplements adaptive sampling based on the uncertainty of tokens during generation
ThinkdeeperN/A for proxyImplements thereasoning_effort param from OpenAI for reasoning models like DeepSeek R1

Implemented plugins

PluginSlugDescription
MCP ClientmcpImplements the model context protocol (MCP) client, enabling you to use any LLM with any MCP Server
RouterrouterUses theoptillm-modernbert-large model to route requests to different approaches based on the user prompt
Chain-of-CodecocImplements a chain of code approach that combines CoT with code execution and LLM based code simulation
MemorymemoryImplements a short term memory layer, enables you to use unbounded context length with any LLM
PrivacyprivacyAnonymize PII data in request and deanonymize it back to original value in response
Read URLsreadurlsReads all URLs found in the request, fetches the content at the URL and adds it to the context
Execute CodeexecutecodeEnables use of code interpreter to execute python code in requests and LLM generated responses
JSONjsonEnables structured outputs using the outlines library, supports pydantic types and JSON schema

Available parameters

optillm supports various command-line arguments for configuration. When using Docker, these can also be set as environment variables prefixed withOPTILLM_.

ParameterDescriptionDefault Value
--approachInference approach to use"auto"
--simulationsNumber of MCTS simulations2
--explorationExploration weight for MCTS0.2
--depthSimulation depth for MCTS1
--best-of-nNumber of samples for best_of_n approach3
--modelOpenAI model to use"gpt-4o-mini"
--base-urlBase URL for OpenAI compatible endpoint""
--rstar-max-depthMaximum depth for rStar algorithm3
--rstar-num-rolloutsNumber of rollouts for rStar algorithm5
--rstar-cExploration constant for rStar algorithm1.4
--nNumber of final responses to be returned1
--return-full-responseReturn the full response including the CoT with tagsFalse
--portSpecify the port to run the proxy8000
--optillm-api-keyOptional API key for client authentication to optillm""
--cepo_*See CePO Parameters section below for detailed configuration optionsVarious
CePO Parameters
ParameterDescriptionDefault Value
--cepo_bestofn_nNumber of responses to be generated in best of n stage3
--cepo_bestofn_temperatureTemperature for verifier in best of n stage0.1
--cepo_bestofn_max_tokensMaximum number of tokens for verifier in best of n stage4096
--cepo_bestofn_rating_typeType of rating in best of n stage ("absolute" or "pairwise")"absolute"
--cepo_planning_nNumber of plans generated in planning stage3
--cepo_planning_mNumber of attempts to generate n plans in planning stage6
--cepo_planning_temperature_step1Temperature for generator in step 1 of planning stage0.55
--cepo_planning_temperature_step2Temperature for generator in step 2 of planning stage0.25
--cepo_planning_temperature_step3Temperature for generator in step 3 of planning stage0.1
--cepo_planning_temperature_step4Temperature for generator in step 4 of planning stage0
--cepo_planning_max_tokens_step1Maximum number of tokens in step 1 of planning stage4096
--cepo_planning_max_tokens_step2Maximum number of tokens in step 2 of planning stage4096
--cepo_planning_max_tokens_step3Maximum number of tokens in step 3 of planning stage4096
--cepo_planning_max_tokens_step4Maximum number of tokens in step 4 of planning stage4096
--cepo_print_outputWhether to print the output of each stageFalse
--cepo_config_filePath to CePO configuration fileNone

Running with Docker

optillm can optionally be built and run using Docker and the providedDockerfile.

Using Docker Compose

  1. Make sure you have Docker and Docker Compose installed on your system.

  2. Either update the environment variables in the docker-compose.yaml file or create a.env file in the project root directory and add any environment variables you want to set. For example, to set the OpenAI API key, add the following line to the.env file:

    OPENAI_API_KEY=your_openai_api_key_here
  3. Run the following command to start optillm:

    docker compose up -d

    This will build the Docker image if it doesn't exist and start the optillm service.

  4. optillm will be available athttp://localhost:8000.

When using Docker, you can set these parameters as environment variables. For example, to set the approach and model, you would use:

OPTILLM_APPROACH=mctsOPTILLM_MODEL=gpt-4

To secure the optillm proxy with an API key, set theOPTILLM_API_KEY environment variable:

OPTILLM_API_KEY=your_secret_api_key

When the API key is set, clients must include it in their requests using theAuthorization header:

Authorization: Bearer your_secret_api_key

SOTA results on benchmarks with optillm

CePO on math and code benchmarks (Jan 2025)

MethodMath-L5MMLU-Pro (Math)GPQACRUXLiveCodeBench (pass@1)Simple QA
Llama 3.1 70B41.672.941.764.224.514.7
Llama 3.3 70B51.078.649.172.627.120.9
Llama 3.1 405B49.879.250.773.031.813.5
CePO (using Llama 3.3 70B)69.684.855.580.131.922.6

coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)

ModelScore
o1-mini56.67
coc-claude-3-5-sonnet-2024102246.67
coc-gemini/gemini-exp-112146.67
o1-preview40.00
gemini-exp-111436.67
claude-3-5-sonnet-2024102220.00
gemini-1.5-pro-00220.00
gemini-1.5-flash-00216.67

readurls&memory-gpt-4o-mini on Google FRAMES Benchmark (Oct 2024)

ModelAccuracy
readurls&memory-gpt-4o-mini61.29
gpt-4o-mini50.61
readurls&memory-Gemma2-9b30.1
Gemma2-9b5.1
Gemma2-27b30.8
Gemini Flash 1.566.5
Gemini Pro 1.572.9

plansearch-gpt-4o-mini on LiveCodeBench (Sep 2024)

Modelpass@1pass@5pass@10
plansearch-gpt-4o-mini44.0359.3163.5
gpt-4o-mini43.950.6153.25
claude-3.5-sonnet51.3
gpt-4o-2024-05-1345.2
gpt-4-turbo-2024-04-0944.2

moa-gpt-4o-mini on Arena-Hard-Auto (Aug 2024)

Results showing Mixture of Agents approach using gpt-4o-mini on Arena Hard Auto Benchmark

optillm with Patchwork (July 2024)

Since optillm is a drop-in replacement for OpenAI API you can easily integrate it with existing tools and frameworks using the OpenAI client. We used optillm withpatchwork which is an open-source framework that automates development gruntwork like PR reviews, bug fixing, security patching using workflowscalled patchflows. We saw huge performance gains across all the supported patchflows as shown below when using the mixture of agents approach (moa).

Results showing optillm mixture of agents approach used with patchflows

References

Citation

If you use this library in your research, please cite:

@software{optillm,title ={Optillm: Optimizing inference proxy for LLMs},author ={Asankhaya Sharma},year ={2024},publisher ={GitHub},url ={https://github.com/codelion/optillm}}

[8]ページ先頭

©2009-2025 Movatter.jp