OpenAI Compatible Server

llama-cpp-python offers an OpenAI API compatible web server.

This web server can be used to serve local models and easily connect them to existing clients.

Setup

Installation

The server can be installed by running the following command:

pipinstallllama-cpp-python[server]

Running the server

The server can then be started by running the following command:

python3-mllama_cpp.server--model<model_path>

Server options

For a full list of options, run:

python3-mllama_cpp.server--help

NOTE: All server options are also available as environment variables. For example,--model can be set by setting theMODEL environment variable.

Check out the server config reference below settings for more information on the available options.CLI arguments and environment variables are available for all of the fields defined inServerSettings andModelSettings

Additionally the server supports configuration check out theconfiguration section for more information and examples.

Guides

Code Completion

llama-cpp-python supports code completion via GitHub Copilot.

NOTE: Without GPU acceleration this is unlikely to be fast enough to be usable.

You'll first need to download one of the available code completion models in GGUF format:

replit-code-v1_5-GGUF

Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:

python3-mllama_cpp.server--model<model_path>--n_ctx16192

Then just update your settings in.vscode/settings.json to point to your code completion server:

{// ..."github.copilot.advanced":{"debug.testOverrideProxyUrl":"http://<host>:<port>","debug.overrideProxyUrl":"http://<host>:<port>"}// ...}

Function Calling

llama-cpp-python supports structured function calling based on a JSON schema.Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.

You'll first need to download one of the available function calling models in GGUF format:

functionary

Then when you run the server you'll need to also specify eitherfunctionary-v1 orfunctionary-v2 chat_format.

Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentionedhere, you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.

python3-mllama_cpp.server--model<model_path_to_functionary_v2_model>--chat_formatfunctionary-v2--hf_pretrained_model_name_or_path<model_path_to_functionary_v2_tokenizer>

Check out thisexample notebook for a walkthrough of some interesting use cases for function calling.

Multimodal Models

llama-cpp-python supports the llava1.5 family of multi-modal models which allow the language model toread information from both text and images.

You'll first need to download one of the available multi-modal models in GGUF format:

Then when you run the server you'll need to also specify the path to the clip model used for image embedding and thellava-1-5 chat_format

python3-mllama_cpp.server--model<model_path>--clip_model_path<clip_model_path>--chat_formatllava-1-5

Then you can just use the OpenAI API as normal

fromopenaiimportOpenAIclient=OpenAI(base_url="http://<host>:<port>/v1",api_key="sk-xxx")response=client.chat.completions.create(model="gpt-4-vision-preview",messages=[{"role":"user","content":[{"type":"image_url","image_url":{"url":"<image_url>"},},{"type":"text","text":"What does the image say"},],}],)print(response)

Configuration and Multi-Model Support

The server supports configuration via a JSON config file that can be passed using the--config_file parameter or theCONFIG_FILE environment variable.

python3-mllama_cpp.server--config_file<config_file>

Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.

The server supports routing requests to multiple models based on themodel parameter in the request which matches against themodel_alias in the config file.

At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.

{"host":"0.0.0.0","port":8080,"models":[{"model":"models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf","model_alias":"gpt-3.5-turbo","chat_format":"chatml","n_gpu_layers":-1,"offload_kqv":true,"n_threads":12,"n_batch":512,"n_ctx":2048},{"model":"models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf","model_alias":"gpt-4","chat_format":"chatml","n_gpu_layers":-1,"offload_kqv":true,"n_threads":12,"n_batch":512,"n_ctx":2048},{"model":"models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf","model_alias":"gpt-4-vision-preview","chat_format":"llava-1-5","clip_model_path":"models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf","n_gpu_layers":-1,"offload_kqv":true,"n_threads":12,"n_batch":512,"n_ctx":2048},{"model":"models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf","model_alias":"text-davinci-003","n_gpu_layers":-1,"offload_kqv":true,"n_threads":12,"n_batch":512,"n_ctx":2048},{"model":"models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf","model_alias":"copilot-codex","n_gpu_layers":-1,"offload_kqv":true,"n_threads":12,"n_batch":1024,"n_ctx":9216}]}

The config file format is defined by theConfigFileSettings class.

Server Options Reference

`llama_cpp.server.settings.ConfigFileSettings`

Bases:ServerSettings

Configuration file format settings.

Source code inllama_cpp/server/settings.py

237238239240

classConfigFileSettings(ServerSettings):"""Configuration file format settings."""models:List[ModelSettings]=Field(default=[],description="Model configs")

`models=Field(default=[],description='Model configs')class-attributeinstance-attribute`

`llama_cpp.server.settings.ServerSettings`

Bases:BaseSettings

Server settings used to configure the FastAPI and Uvicorn server.

Source code inllama_cpp/server/settings.py

202203204205206207208209210211212213214215216217218219220221222223224225226227228229230

classServerSettings(BaseSettings):"""Server settings used to configure the FastAPI and Uvicorn server."""# Uvicorn Settingshost:str=Field(default="localhost",description="Listen address")port:int=Field(default=8000,description="Listen port")ssl_keyfile:Optional[str]=Field(default=None,description="SSL key file for HTTPS")ssl_certfile:Optional[str]=Field(default=None,description="SSL certificate file for HTTPS")# FastAPI Settingsapi_key:Optional[str]=Field(default=None,description="API key for authentication. If set all requests need to be authenticated.",)interrupt_requests:bool=Field(default=True,description="Whether to interrupt requests when a new request is received.",)disable_ping_events:bool=Field(default=False,description="Disable EventSource pings (may be needed for some clients).",)root_path:str=Field(default="",description="The root path for the server. Useful when running behind a reverse proxy.",)

`host=Field(default='localhost',description='Listen address')class-attributeinstance-attribute`

`port=Field(default=8000,description='Listen port')class-attributeinstance-attribute`

`ssl_keyfile=Field(default=None,description='SSL key file for HTTPS')class-attributeinstance-attribute`

`ssl_certfile=Field(default=None,description='SSL certificate file for HTTPS')class-attributeinstance-attribute`

`api_key=Field(default=None,description='API key for authentication. If set all requests need to be authenticated.')class-attributeinstance-attribute`

`interrupt_requests=Field(default=True,description='Whether to interrupt requests when a new request is received.')class-attributeinstance-attribute`

`disable_ping_events=Field(default=False,description='Disable EventSource pings (may be needed for some clients).')class-attributeinstance-attribute`

`root_path=Field(default='',description='The root path for the server. Useful when running behind a reverse proxy.')class-attributeinstance-attribute`

`llama_cpp.server.settings.ModelSettings`

Bases:BaseSettings

Model settings used to load a Llama model.

Source code inllama_cpp/server/settings.py

 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199

classModelSettings(BaseSettings):"""Model settings used to load a Llama model."""model:str=Field(description="The path to the model to use for generating completions.")model_alias:Optional[str]=Field(default=None,description="The alias of the model to use for generating completions.",)# Model Paramsn_gpu_layers:int=Field(default=0,ge=-1,description="The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.",)split_mode:int=Field(default=llama_cpp.LLAMA_SPLIT_MODE_LAYER,description="The split mode to use.",)main_gpu:int=Field(default=0,ge=0,description="Main GPU to use.",)tensor_split:Optional[List[float]]=Field(default=None,description="Split layers across multiple GPUs in proportion.",)vocab_only:bool=Field(default=False,description="Whether to only return the vocabulary.")use_mmap:bool=Field(default=llama_cpp.llama_supports_mmap(),description="Use mmap.",)use_mlock:bool=Field(default=llama_cpp.llama_supports_mlock(),description="Use mlock.",)kv_overrides:Optional[List[str]]=Field(default=None,description="List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.",)rpc_servers:Optional[str]=Field(default=None,description="comma seperated list of rpc servers for offloading",)# Context Paramsseed:int=Field(default=llama_cpp.LLAMA_DEFAULT_SEED,description="Random seed. -1 for random.")n_ctx:int=Field(default=2048,ge=0,description="The context size.")n_batch:int=Field(default=512,ge=1,description="The batch size to use per eval.")n_ubatch:int=Field(default=512,ge=1,description="The physical batch size used by llama.cpp")n_threads:int=Field(default=max(multiprocessing.cpu_count()//2,1),ge=1,description="The number of threads to use. Use -1 for max cpu threads",)n_threads_batch:int=Field(default=max(multiprocessing.cpu_count(),1),ge=0,description="The number of threads to use when batch processing. Use -1 for max cpu threads",)rope_scaling_type:int=Field(default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED)rope_freq_base:float=Field(default=0.0,description="RoPE base frequency")rope_freq_scale:float=Field(default=0.0,description="RoPE frequency scaling factor")yarn_ext_factor:float=Field(default=-1.0)yarn_attn_factor:float=Field(default=1.0)yarn_beta_fast:float=Field(default=32.0)yarn_beta_slow:float=Field(default=1.0)yarn_orig_ctx:int=Field(default=0)mul_mat_q:bool=Field(default=True,description="if true, use experimental mul_mat_q kernels")logits_all:bool=Field(default=True,description="Whether to return logits.")embedding:bool=Field(default=False,description="Whether to use embeddings.")offload_kqv:bool=Field(default=True,description="Whether to offload kqv to the GPU.")flash_attn:bool=Field(default=False,description="Whether to use flash attention.")# Sampling Paramslast_n_tokens_size:int=Field(default=64,ge=0,description="Last n tokens to keep for repeat penalty calculation.",)# LoRA Paramslora_base:Optional[str]=Field(default=None,description="Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.",)lora_path:Optional[str]=Field(default=None,description="Path to a LoRA file to apply to the model.",)# Backend Paramsnuma:Union[bool,int]=Field(default=False,description="Enable NUMA support.",)# Chat Format Paramschat_format:Optional[str]=Field(default=None,description="Chat format to use.",)clip_model_path:Optional[str]=Field(default=None,description="Path to a CLIP model to use for multi-modal chat completion.",)# Cache Paramscache:bool=Field(default=False,description="Use a cache to reduce processing times for evaluated prompts.",)cache_type:Literal["ram","disk"]=Field(default="ram",description="The type of cache to use. Only used if cache is True.",)cache_size:int=Field(default=2<<30,description="The size of the cache in bytes. Only used if cache is True.",)# Tokenizer Optionshf_tokenizer_config_path:Optional[str]=Field(default=None,description="The path to a HuggingFace tokenizer_config.json file.",)hf_pretrained_model_name_or_path:Optional[str]=Field(default=None,description="The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().",)# Loading from HuggingFace Model Hubhf_model_repo_id:Optional[str]=Field(default=None,description="The model repo id to use for the HuggingFace tokenizer model.",)# Speculative Decodingdraft_model:Optional[str]=Field(default=None,description="Method to use for speculative decoding. One of (prompt-lookup-decoding).",)draft_model_num_pred_tokens:int=Field(default=10,description="Number of tokens to predict using the draft model.",)# KV Cache Quantizationtype_k:Optional[int]=Field(default=None,description="Type of the key cache quantization.",)type_v:Optional[int]=Field(default=None,description="Type of the value cache quantization.",)# Miscverbose:bool=Field(default=True,description="Whether to print debug information.")@model_validator(mode="before")# pre=True to ensure this runs before any other validationdefset_dynamic_defaults(self)->Self:# If n_threads or n_threads_batch is -1, set it to multiprocessing.cpu_count()cpu_count=multiprocessing.cpu_count()values=cast(Dict[str,int],self)ifvalues.get("n_threads",0)==-1:values["n_threads"]=cpu_countifvalues.get("n_threads_batch",0)==-1:values["n_threads_batch"]=cpu_countreturnself

`model=Field(description='The path to the model to use for generating completions.')class-attributeinstance-attribute`

`model_alias=Field(default=None,description='The alias of the model to use for generating completions.')class-attributeinstance-attribute`

`n_gpu_layers=Field(default=0,ge=-1,description='The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.')class-attributeinstance-attribute`

`split_mode=Field(default=llama_cpp.LLAMA_SPLIT_MODE_LAYER,description='The split mode to use.')class-attributeinstance-attribute`

`main_gpu=Field(default=0,ge=0,description='Main GPU to use.')class-attributeinstance-attribute`

`tensor_split=Field(default=None,description='Split layers across multiple GPUs in proportion.')class-attributeinstance-attribute`

`vocab_only=Field(default=False,description='Whether to only return the vocabulary.')class-attributeinstance-attribute`

`use_mmap=Field(default=llama_cpp.llama_supports_mmap(),description='Use mmap.')class-attributeinstance-attribute`

`use_mlock=Field(default=llama_cpp.llama_supports_mlock(),description='Use mlock.')class-attributeinstance-attribute`

`kv_overrides=Field(default=None,description='List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.')class-attributeinstance-attribute`

`rpc_servers=Field(default=None,description='comma seperated list of rpc servers for offloading')class-attributeinstance-attribute`

`seed=Field(default=llama_cpp.LLAMA_DEFAULT_SEED,description='Random seed. -1 for random.')class-attributeinstance-attribute`

`n_ctx=Field(default=2048,ge=0,description='The context size.')class-attributeinstance-attribute`

`n_batch=Field(default=512,ge=1,description='The batch size to use per eval.')class-attributeinstance-attribute`

`n_ubatch=Field(default=512,ge=1,description='The physical batch size used by llama.cpp')class-attributeinstance-attribute`

`n_threads=Field(default=max(multiprocessing.cpu_count()//2,1),ge=1,description='The number of threads to use. Use -1 for max cpu threads')class-attributeinstance-attribute`

`n_threads_batch=Field(default=max(multiprocessing.cpu_count(),1),ge=0,description='The number of threads to use when batch processing. Use -1 for max cpu threads')class-attributeinstance-attribute`

`rope_scaling_type=Field(default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED)class-attributeinstance-attribute`

`rope_freq_base=Field(default=0.0,description='RoPE base frequency')class-attributeinstance-attribute`

`rope_freq_scale=Field(default=0.0,description='RoPE frequency scaling factor')class-attributeinstance-attribute`

`yarn_ext_factor=Field(default=-1.0)class-attributeinstance-attribute`

`yarn_attn_factor=Field(default=1.0)class-attributeinstance-attribute`

`yarn_beta_fast=Field(default=32.0)class-attributeinstance-attribute`

`yarn_beta_slow=Field(default=1.0)class-attributeinstance-attribute`

`yarn_orig_ctx=Field(default=0)class-attributeinstance-attribute`

`mul_mat_q=Field(default=True,description='if true, use experimental mul_mat_q kernels')class-attributeinstance-attribute`

`logits_all=Field(default=True,description='Whether to return logits.')class-attributeinstance-attribute`

`embedding=Field(default=False,description='Whether to use embeddings.')class-attributeinstance-attribute`

`offload_kqv=Field(default=True,description='Whether to offload kqv to the GPU.')class-attributeinstance-attribute`

`flash_attn=Field(default=False,description='Whether to use flash attention.')class-attributeinstance-attribute`

`last_n_tokens_size=Field(default=64,ge=0,description='Last n tokens to keep for repeat penalty calculation.')class-attributeinstance-attribute`

`lora_base=Field(default=None,description='Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.')class-attributeinstance-attribute`

`lora_path=Field(default=None,description='Path to a LoRA file to apply to the model.')class-attributeinstance-attribute`

`numa=Field(default=False,description='Enable NUMA support.')class-attributeinstance-attribute`

`chat_format=Field(default=None,description='Chat format to use.')class-attributeinstance-attribute`

`clip_model_path=Field(default=None,description='Path to a CLIP model to use for multi-modal chat completion.')class-attributeinstance-attribute`

`cache=Field(default=False,description='Use a cache to reduce processing times for evaluated prompts.')class-attributeinstance-attribute`

`cache_type=Field(default='ram',description='The type of cache to use. Only used if cache is True.')class-attributeinstance-attribute`

`cache_size=Field(default=2<<30,description='The size of the cache in bytes. Only used if cache is True.')class-attributeinstance-attribute`

`hf_tokenizer_config_path=Field(default=None,description='The path to a HuggingFace tokenizer_config.json file.')class-attributeinstance-attribute`

`hf_pretrained_model_name_or_path=Field(default=None,description='The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().')class-attributeinstance-attribute`

`hf_model_repo_id=Field(default=None,description='The model repo id to use for the HuggingFace tokenizer model.')class-attributeinstance-attribute`

`draft_model=Field(default=None,description='Method to use for speculative decoding. One of (prompt-lookup-decoding).')class-attributeinstance-attribute`

`draft_model_num_pred_tokens=Field(default=10,description='Number of tokens to predict using the draft model.')class-attributeinstance-attribute`

`type_k=Field(default=None,description='Type of the key cache quantization.')class-attributeinstance-attribute`

`type_v=Field(default=None,description='Type of the value cache quantization.')class-attributeinstance-attribute`

`verbose=Field(default=True,description='Whether to print debug information.')class-attributeinstance-attribute`

`set_dynamic_defaults()`

Source code inllama_cpp/server/settings.py

188189190191192193194195196197198199

@model_validator(mode="before")# pre=True to ensure this runs before any other validationdefset_dynamic_defaults(self)->Self:# If n_threads or n_threads_batch is -1, set it to multiprocessing.cpu_count()cpu_count=multiprocessing.cpu_count()values=cast(Dict[str,int],self)ifvalues.get("n_threads",0)==-1:values["n_threads"]=cpu_countifvalues.get("n_threads_batch",0)==-1:values["n_threads_batch"]=cpu_countreturnself

Movatterモバイル変換

OpenAI Compatible Server

Setup

Installation

Running the server

Server options

Guides

Code Completion

Function Calling

Multimodal Models

Configuration and Multi-Model Support

Server Options Reference

llama_cpp.server.settings.ConfigFileSettings

models=Field(default=[],description='Model configs')class-attributeinstance-attribute

llama_cpp.server.settings.ServerSettings

host=Field(default='localhost',description='Listen address')class-attributeinstance-attribute

port=Field(default=8000,description='Listen port')class-attributeinstance-attribute

ssl_keyfile=Field(default=None,description='SSL key file for HTTPS')class-attributeinstance-attribute

ssl_certfile=Field(default=None,description='SSL certificate file for HTTPS')class-attributeinstance-attribute

api_key=Field(default=None,description='API key for authentication. If set all requests need to be authenticated.')class-attributeinstance-attribute

interrupt_requests=Field(default=True,description='Whether to interrupt requests when a new request is received.')class-attributeinstance-attribute

disable_ping_events=Field(default=False,description='Disable EventSource pings (may be needed for some clients).')class-attributeinstance-attribute

root_path=Field(default='',description='The root path for the server. Useful when running behind a reverse proxy.')class-attributeinstance-attribute

llama_cpp.server.settings.ModelSettings

model=Field(description='The path to the model to use for generating completions.')class-attributeinstance-attribute

model_alias=Field(default=None,description='The alias of the model to use for generating completions.')class-attributeinstance-attribute

n_gpu_layers=Field(default=0,ge=-1,description='The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.')class-attributeinstance-attribute

split_mode=Field(default=llama_cpp.LLAMA_SPLIT_MODE_LAYER,description='The split mode to use.')class-attributeinstance-attribute

main_gpu=Field(default=0,ge=0,description='Main GPU to use.')class-attributeinstance-attribute

tensor_split=Field(default=None,description='Split layers across multiple GPUs in proportion.')class-attributeinstance-attribute

vocab_only=Field(default=False,description='Whether to only return the vocabulary.')class-attributeinstance-attribute

use_mmap=Field(default=llama_cpp.llama_supports_mmap(),description='Use mmap.')class-attributeinstance-attribute

use_mlock=Field(default=llama_cpp.llama_supports_mlock(),description='Use mlock.')class-attributeinstance-attribute

kv_overrides=Field(default=None,description='List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.')class-attributeinstance-attribute

rpc_servers=Field(default=None,description='comma seperated list of rpc servers for offloading')class-attributeinstance-attribute

seed=Field(default=llama_cpp.LLAMA_DEFAULT_SEED,description='Random seed. -1 for random.')class-attributeinstance-attribute

n_ctx=Field(default=2048,ge=0,description='The context size.')class-attributeinstance-attribute

n_batch=Field(default=512,ge=1,description='The batch size to use per eval.')class-attributeinstance-attribute

n_ubatch=Field(default=512,ge=1,description='The physical batch size used by llama.cpp')class-attributeinstance-attribute

n_threads=Field(default=max(multiprocessing.cpu_count()//2,1),ge=1,description='The number of threads to use. Use -1 for max cpu threads')class-attributeinstance-attribute

n_threads_batch=Field(default=max(multiprocessing.cpu_count(),1),ge=0,description='The number of threads to use when batch processing. Use -1 for max cpu threads')class-attributeinstance-attribute

rope_scaling_type=Field(default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED)class-attributeinstance-attribute

rope_freq_base=Field(default=0.0,description='RoPE base frequency')class-attributeinstance-attribute

rope_freq_scale=Field(default=0.0,description='RoPE frequency scaling factor')class-attributeinstance-attribute

yarn_ext_factor=Field(default=-1.0)class-attributeinstance-attribute

yarn_attn_factor=Field(default=1.0)class-attributeinstance-attribute

yarn_beta_fast=Field(default=32.0)class-attributeinstance-attribute

yarn_beta_slow=Field(default=1.0)class-attributeinstance-attribute

yarn_orig_ctx=Field(default=0)class-attributeinstance-attribute

mul_mat_q=Field(default=True,description='if true, use experimental mul_mat_q kernels')class-attributeinstance-attribute

logits_all=Field(default=True,description='Whether to return logits.')class-attributeinstance-attribute

embedding=Field(default=False,description='Whether to use embeddings.')class-attributeinstance-attribute

offload_kqv=Field(default=True,description='Whether to offload kqv to the GPU.')class-attributeinstance-attribute

flash_attn=Field(default=False,description='Whether to use flash attention.')class-attributeinstance-attribute

last_n_tokens_size=Field(default=64,ge=0,description='Last n tokens to keep for repeat penalty calculation.')class-attributeinstance-attribute

lora_base=Field(default=None,description='Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.')class-attributeinstance-attribute

lora_path=Field(default=None,description='Path to a LoRA file to apply to the model.')class-attributeinstance-attribute

numa=Field(default=False,description='Enable NUMA support.')class-attributeinstance-attribute

chat_format=Field(default=None,description='Chat format to use.')class-attributeinstance-attribute

clip_model_path=Field(default=None,description='Path to a CLIP model to use for multi-modal chat completion.')class-attributeinstance-attribute

cache=Field(default=False,description='Use a cache to reduce processing times for evaluated prompts.')class-attributeinstance-attribute

cache_type=Field(default='ram',description='The type of cache to use. Only used if cache is True.')class-attributeinstance-attribute

cache_size=Field(default=2<<30,description='The size of the cache in bytes. Only used if cache is True.')class-attributeinstance-attribute

hf_tokenizer_config_path=Field(default=None,description='The path to a HuggingFace tokenizer_config.json file.')class-attributeinstance-attribute

hf_pretrained_model_name_or_path=Field(default=None,description='The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().')class-attributeinstance-attribute

hf_model_repo_id=Field(default=None,description='The model repo id to use for the HuggingFace tokenizer model.')class-attributeinstance-attribute

draft_model=Field(default=None,description='Method to use for speculative decoding. One of (prompt-lookup-decoding).')class-attributeinstance-attribute

draft_model_num_pred_tokens=Field(default=10,description='Number of tokens to predict using the draft model.')class-attributeinstance-attribute

type_k=Field(default=None,description='Type of the key cache quantization.')class-attributeinstance-attribute

type_v=Field(default=None,description='Type of the value cache quantization.')class-attributeinstance-attribute

verbose=Field(default=True,description='Whether to print debug information.')class-attributeinstance-attribute

set_dynamic_defaults()

`llama_cpp.server.settings.ConfigFileSettings`

`models=Field(default=[],description='Model configs')class-attributeinstance-attribute`

`llama_cpp.server.settings.ServerSettings`

`host=Field(default='localhost',description='Listen address')class-attributeinstance-attribute`

`port=Field(default=8000,description='Listen port')class-attributeinstance-attribute`

`ssl_keyfile=Field(default=None,description='SSL key file for HTTPS')class-attributeinstance-attribute`

`ssl_certfile=Field(default=None,description='SSL certificate file for HTTPS')class-attributeinstance-attribute`

`api_key=Field(default=None,description='API key for authentication. If set all requests need to be authenticated.')class-attributeinstance-attribute`

`interrupt_requests=Field(default=True,description='Whether to interrupt requests when a new request is received.')class-attributeinstance-attribute`

`disable_ping_events=Field(default=False,description='Disable EventSource pings (may be needed for some clients).')class-attributeinstance-attribute`

`root_path=Field(default='',description='The root path for the server. Useful when running behind a reverse proxy.')class-attributeinstance-attribute`

`llama_cpp.server.settings.ModelSettings`

`model=Field(description='The path to the model to use for generating completions.')class-attributeinstance-attribute`

`model_alias=Field(default=None,description='The alias of the model to use for generating completions.')class-attributeinstance-attribute`

`n_gpu_layers=Field(default=0,ge=-1,description='The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.')class-attributeinstance-attribute`

`split_mode=Field(default=llama_cpp.LLAMA_SPLIT_MODE_LAYER,description='The split mode to use.')class-attributeinstance-attribute`

`main_gpu=Field(default=0,ge=0,description='Main GPU to use.')class-attributeinstance-attribute`

`tensor_split=Field(default=None,description='Split layers across multiple GPUs in proportion.')class-attributeinstance-attribute`

`vocab_only=Field(default=False,description='Whether to only return the vocabulary.')class-attributeinstance-attribute`

`use_mmap=Field(default=llama_cpp.llama_supports_mmap(),description='Use mmap.')class-attributeinstance-attribute`

`use_mlock=Field(default=llama_cpp.llama_supports_mlock(),description='Use mlock.')class-attributeinstance-attribute`

`kv_overrides=Field(default=None,description='List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.')class-attributeinstance-attribute`

`rpc_servers=Field(default=None,description='comma seperated list of rpc servers for offloading')class-attributeinstance-attribute`

`seed=Field(default=llama_cpp.LLAMA_DEFAULT_SEED,description='Random seed. -1 for random.')class-attributeinstance-attribute`

`n_ctx=Field(default=2048,ge=0,description='The context size.')class-attributeinstance-attribute`

`n_batch=Field(default=512,ge=1,description='The batch size to use per eval.')class-attributeinstance-attribute`

`n_ubatch=Field(default=512,ge=1,description='The physical batch size used by llama.cpp')class-attributeinstance-attribute`

`n_threads=Field(default=max(multiprocessing.cpu_count()//2,1),ge=1,description='The number of threads to use. Use -1 for max cpu threads')class-attributeinstance-attribute`

`n_threads_batch=Field(default=max(multiprocessing.cpu_count(),1),ge=0,description='The number of threads to use when batch processing. Use -1 for max cpu threads')class-attributeinstance-attribute`

`rope_scaling_type=Field(default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED)class-attributeinstance-attribute`

`rope_freq_base=Field(default=0.0,description='RoPE base frequency')class-attributeinstance-attribute`

`rope_freq_scale=Field(default=0.0,description='RoPE frequency scaling factor')class-attributeinstance-attribute`

`yarn_ext_factor=Field(default=-1.0)class-attributeinstance-attribute`

`yarn_attn_factor=Field(default=1.0)class-attributeinstance-attribute`

`yarn_beta_fast=Field(default=32.0)class-attributeinstance-attribute`

`yarn_beta_slow=Field(default=1.0)class-attributeinstance-attribute`

`yarn_orig_ctx=Field(default=0)class-attributeinstance-attribute`

`mul_mat_q=Field(default=True,description='if true, use experimental mul_mat_q kernels')class-attributeinstance-attribute`

`logits_all=Field(default=True,description='Whether to return logits.')class-attributeinstance-attribute`

`embedding=Field(default=False,description='Whether to use embeddings.')class-attributeinstance-attribute`

`offload_kqv=Field(default=True,description='Whether to offload kqv to the GPU.')class-attributeinstance-attribute`

`flash_attn=Field(default=False,description='Whether to use flash attention.')class-attributeinstance-attribute`

`last_n_tokens_size=Field(default=64,ge=0,description='Last n tokens to keep for repeat penalty calculation.')class-attributeinstance-attribute`

`lora_base=Field(default=None,description='Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.')class-attributeinstance-attribute`

`lora_path=Field(default=None,description='Path to a LoRA file to apply to the model.')class-attributeinstance-attribute`

`numa=Field(default=False,description='Enable NUMA support.')class-attributeinstance-attribute`

`chat_format=Field(default=None,description='Chat format to use.')class-attributeinstance-attribute`

`clip_model_path=Field(default=None,description='Path to a CLIP model to use for multi-modal chat completion.')class-attributeinstance-attribute`

`cache=Field(default=False,description='Use a cache to reduce processing times for evaluated prompts.')class-attributeinstance-attribute`

`cache_type=Field(default='ram',description='The type of cache to use. Only used if cache is True.')class-attributeinstance-attribute`

`cache_size=Field(default=2<<30,description='The size of the cache in bytes. Only used if cache is True.')class-attributeinstance-attribute`

`hf_tokenizer_config_path=Field(default=None,description='The path to a HuggingFace tokenizer_config.json file.')class-attributeinstance-attribute`

`hf_pretrained_model_name_or_path=Field(default=None,description='The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().')class-attributeinstance-attribute`

`hf_model_repo_id=Field(default=None,description='The model repo id to use for the HuggingFace tokenizer model.')class-attributeinstance-attribute`

`draft_model=Field(default=None,description='Method to use for speculative decoding. One of (prompt-lookup-decoding).')class-attributeinstance-attribute`

`draft_model_num_pred_tokens=Field(default=10,description='Number of tokens to predict using the draft model.')class-attributeinstance-attribute`

`type_k=Field(default=None,description='Type of the key cache quantization.')class-attributeinstance-attribute`

`type_v=Field(default=None,description='Type of the value cache quantization.')class-attributeinstance-attribute`

`verbose=Field(default=True,description='Whether to print debug information.')class-attributeinstance-attribute`

`set_dynamic_defaults()`