- Notifications
You must be signed in to change notification settings - Fork1.2k
Python bindings for llama.cpp
License
abetlen/llama-cpp-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Python Bindings forllama.cpp
Simple Python bindings for@ggerganov'sllama.cpp
library.This package provides:
- Low-level access to C API via
ctypes
interface. - High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
- OpenAI compatible web server
Documentation is available athttps://llama-cpp-python.readthedocs.io/en/latest.
Requirements:
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode
To install the package, run:
pip install llama-cpp-python
This will also buildllama.cpp
from source and install it alongside this python package.
If this fails, add--verbose
to thepip install
see the full cmake build log.
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with basic CPU support.
pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
llama.cpp
supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See thellama.cpp README for a full list.
Allllama.cpp
cmake build options can be set via theCMAKE_ARGS
environment variable or via the--config-settings / -C
cli flag during installation.
Environment Variables
# Linux and MacCMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \ pip install llama-cpp-python
# Windows$env:CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"pip install llama-cpp-python
CLI / requirements.txt
They can also be set viapip install -C / --config-settings
command and saved to arequirements.txt
file:
pip install --upgrade pip# ensure pip is up to datepip install llama-cpp-python \ -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
# requirements.txtllama-cpp-python -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
Below are some common backends, their build commands and any additional environment variables required.
OpenBLAS (CPU)
To install with OpenBLAS, set theGGML_BLAS
andGGML_BLAS_VENDOR
environment variables before installing:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
CUDA
To install with CUDA support, set theGGML_CUDA=on
environment variable before installing:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with CUDA support. As long as your system meets some requirements:
- CUDA Version is 12.1, 12.2, 12.3, 12.4 or 12.5
- Python Version is 3.10, 3.11 or 3.12
pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
Where<cuda-version>
is one of the following:
cu121
: CUDA 12.1cu122
: CUDA 12.2cu123
: CUDA 12.3cu124
: CUDA 12.4cu125
: CUDA 12.5
For example, to install the CUDA 12.1 wheel:
pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
Metal
To install with Metal (MPS), set theGGML_METAL=on
environment variable before installing:
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with Metal support. As long as your system meets some requirements:
- MacOS Version is 11.0 or later
- Python Version is 3.10, 3.11 or 3.12
pip install llama-cpp-python \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
hipBLAS (ROCm)
To install with hipBLAS / ROCm support for AMD cards, set theGGML_HIPBLAS=on
environment variable before installing:
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python
Vulkan
To install with Vulkan support, set theGGML_VULKAN=on
environment variable before installing:
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
SYCL
To install with SYCL support, set theGGML_SYCL=on
environment variable before installing:
source /opt/intel/oneapi/setvars.sh CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
RPC
To install with RPC support, set theGGML_RPC=on
environment variable before installing:
source /opt/intel/oneapi/setvars.sh CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python
Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'
If you run into issues where it complains it can't find'nmake'
'?'
or CMAKE_C_COMPILER, you can extract w64devkit asmentioned in llama.cpp repo and add those manually to CMAKE_ARGS before runningpip
install:
$env:CMAKE_GENERATOR="MinGWMakefiles"$env:CMAKE_ARGS="-DGGML_OPENBLAS=on-DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe-DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
See the above instructions and setCMAKE_ARGS
to the BLAS backend you want to use.
Detailed MacOS Metal GPU install documentation is available atdocs/install/macos.md
M1 Mac Performance Issue
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.shbash Miniforge3-MacOSX-arm64.sh
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
Try installing with
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
To upgrade and rebuildllama-cpp-python
add--upgrade --force-reinstall --no-cache-dir
flags to thepip install
command to ensure the package is rebuilt from source.
The high-level API provides a simple managed interface through theLlama
class.
Below is a short example demonstrating how to use the high-level API to for basic text completion:
fromllama_cppimportLlamallm=Llama(model_path="./models/7B/llama-model.gguf",# n_gpu_layers=-1, # Uncomment to use GPU acceleration# seed=1337, # Uncomment to set a specific seed# n_ctx=2048, # Uncomment to increase the context window)output=llm("Q: Name the planets in the solar system? A: ",# Promptmax_tokens=32,# Generate up to 32 tokens, set to None to generate up to the end of the context windowstop=["Q:","\n"],# Stop generating just before the model would generate a new questionecho=True# Echo the prompt back in the output)# Generate a completion, can also call create_completionprint(output)
By defaultllama-cpp-python
generates completions in an OpenAI compatible format:
{"id":"cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","object":"text_completion","created":1679561337,"model":"./models/7B/llama-model.gguf","choices": [ {"text":"Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.","index":0,"logprobs":None,"finish_reason":"stop" } ],"usage": {"prompt_tokens":14,"completion_tokens":28,"total_tokens":42 }}
Text completion is available through the__call__
andcreate_completion
methods of theLlama
class.
You can downloadLlama
models ingguf
format directly from Hugging Face using thefrom_pretrained
method.You'll need to install thehuggingface-hub
package to use this feature (pip install huggingface-hub
).
llm=Llama.from_pretrained(repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",filename="*q8_0.gguf",verbose=False)
By defaultfrom_pretrained
will download the model to the huggingface cache directory, you can then manage installed model files with thehuggingface-cli
tool.
The high-level API also provides a simple interface for chat completion.
Chat completion requires that the model knows how to format the messages into a single prompt.TheLlama
class does this using pre-registered chat formats (ie.chatml
,llama-2
,gemma
, etc) or by providing a custom chat handler object.
The model will will format the messages into a single prompt using the following order of precedence:
- Use the
chat_handler
if provided - Use the
chat_format
if provided - Use the
tokenizer.chat_template
from thegguf
model's metadata (should work for most new models, older models may not have this) - else, fallback to the
llama-2
chat format
Setverbose=True
to see the selected chat format.
fromllama_cppimportLlamallm=Llama(model_path="path/to/llama-2/llama-model.gguf",chat_format="llama-2")llm.create_chat_completion(messages= [ {"role":"system","content":"You are an assistant who perfectly describes images."}, {"role":"user","content":"Describe this image in detail please." } ])
Chat completion is available through thecreate_chat_completion
method of theLlama
class.
For OpenAI API v1 compatibility, you use thecreate_chat_completion_openai_v1
method which will return pydantic models instead of dicts.
To constrain chat responses to only valid JSON or a specific JSON Schema use theresponse_format
argument increate_chat_completion
.
The following example will constrain the response to valid JSON strings only.
fromllama_cppimportLlamallm=Llama(model_path="path/to/model.gguf",chat_format="chatml")llm.create_chat_completion(messages=[ {"role":"system","content":"You are a helpful assistant that outputs in JSON.", }, {"role":"user","content":"Who won the world series in 2020"}, ],response_format={"type":"json_object", },temperature=0.7,)
To constrain the response further to a specific JSON Schema add the schema to theschema
property of theresponse_format
argument.
fromllama_cppimportLlamallm=Llama(model_path="path/to/model.gguf",chat_format="chatml")llm.create_chat_completion(messages=[ {"role":"system","content":"You are a helpful assistant that outputs in JSON.", }, {"role":"user","content":"Who won the world series in 2020"}, ],response_format={"type":"json_object","schema": {"type":"object","properties": {"team_name": {"type":"string"}},"required": ["team_name"], }, },temperature=0.7,)
The high-level API supports OpenAI compatible function and tool calling. This is possible through thefunctionary
pre-trained models chat format or through the genericchatml-function-calling
chat format.
fromllama_cppimportLlamallm=Llama(model_path="path/to/chatml/llama-model.gguf",chat_format="chatml-function-calling")llm.create_chat_completion(messages= [ {"role":"system","content":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary" }, {"role":"user","content":"Extract Jason is 25 years old" } ],tools=[{"type":"function","function": {"name":"UserDetail","parameters": {"type":"object","title":"UserDetail","properties": {"name": {"title":"Name","type":"string" },"age": {"title":"Age","type":"integer" } },"required": ["name","age" ] } } }],tool_choice={"type":"function","function": {"name":"UserDetail" } })
Functionary v2
The various gguf-converted files for this set of models can be foundhere. Functionary is able to intelligently call functions and also analyze any provided function outputs to generate coherent responses. All v2 models of functionary supportsparallel function calling. You can provide eitherfunctionary-v1
orfunctionary-v2
for thechat_format
when initializing the Llama class.
Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. TheLlamaHFTokenizer
class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
fromllama_cppimportLlamafromllama_cpp.llama_tokenizerimportLlamaHFTokenizerllm=Llama.from_pretrained(repo_id="meetkai/functionary-small-v2.2-GGUF",filename="functionary-small-v2.2.q4_0.gguf",chat_format="functionary-v2",tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-small-v2.2-GGUF"))
NOTE: There is no need to provide the default system messages used in Functionary as they are added automatically in the Functionary chat handler. Thus, the messages should contain just the chat messages and/or system messages that provide additional context for the model (e.g.: datetime, etc.).
llama-cpp-python
supports such as llava1.5 which allow the language model to read information from both text and images.
Below are the supported multi-modal models and their respective chat handlers (Python API) and chat formats (Server API).
Model | LlamaChatHandler | chat_format |
---|---|---|
llava-v1.5-7b | Llava15ChatHandler | llava-1-5 |
llava-v1.5-13b | Llava15ChatHandler | llava-1-5 |
llava-v1.6-34b | Llava16ChatHandler | llava-1-6 |
moondream2 | MoondreamChatHandler | moondream2 |
nanollava | NanollavaChatHandler | nanollava |
llama-3-vision-alpha | Llama3VisionAlphaChatHandler | llama-3-vision-alpha |
minicpm-v-2.6 | MiniCPMv26ChatHandler | minicpm-v-2.6 |
qwen2.5-vl | Qwen25VLChatHandler | qwen2.5-vl |
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
fromllama_cppimportLlamafromllama_cpp.llama_chat_formatimportLlava15ChatHandlerchat_handler=Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")llm=Llama(model_path="./path/to/llava/llama-model.gguf",chat_handler=chat_handler,n_ctx=2048,# n_ctx should be increased to accommodate the image embedding)llm.create_chat_completion(messages= [ {"role":"system","content":"You are an assistant who perfectly describes images."}, {"role":"user","content": [ {"type" :"text","text":"What's in this image?"}, {"type":"image_url","image_url": {"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } ] } ])
You can also pull the model from the Hugging Face Hub using thefrom_pretrained
method.
fromllama_cppimportLlamafromllama_cpp.llama_chat_formatimportMoondreamChatHandlerchat_handler=MoondreamChatHandler.from_pretrained(repo_id="vikhyatk/moondream2",filename="*mmproj*",)llm=Llama.from_pretrained(repo_id="vikhyatk/moondream2",filename="*text-model*",chat_handler=chat_handler,n_ctx=2048,# n_ctx should be increased to accommodate the image embedding)response=llm.create_chat_completion(messages= [ {"role":"user","content": [ {"type" :"text","text":"What's in this image?"}, {"type":"image_url","image_url": {"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } ] } ])print(response["choices"][0]["text"])
Note: Multi-modal models also support tool calling and JSON mode.
Loading a Local Image
Images can be passed as base64 encoded data URIs. The following example demonstrates how to do this.
importbase64defimage_to_base64_data_uri(file_path):withopen(file_path,"rb")asimg_file:base64_data=base64.b64encode(img_file.read()).decode('utf-8')returnf"data:image/png;base64,{base64_data}"# Replace 'file_path.png' with the actual path to your PNG filefile_path='file_path.png'data_uri=image_to_base64_data_uri(file_path)messages= [ {"role":"system","content":"You are an assistant who perfectly describes images."}, {"role":"user","content": [ {"type":"image_url","image_url": {"url":data_uri }}, {"type" :"text","text":"Describe this image in detail please."} ] }]
llama-cpp-python
supports speculative decoding which allows the model to generate completions based on a draft model.
The fastest way to use speculative decoding is through theLlamaPromptLookupDecoding
class.
Just pass this as a draft model to theLlama
class during initialization.
fromllama_cppimportLlamafromllama_cpp.llama_speculativeimportLlamaPromptLookupDecodingllama=Llama(model_path="path/to/model.gguf",draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10)# num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.)
To generate text embeddings usecreate_embedding
orembed
. Note that you must passembedding=True
to the constructor upon model creation for these to work properly.
importllama_cppllm=llama_cpp.Llama(model_path="path/to/model.gguf",embedding=True)embeddings=llm.create_embedding("Hello, world!")# or create multiple embeddings at onceembeddings=llm.create_embedding(["Hello, world!","Goodbye, world!"])
There are two primary notions of embeddings in a Transformer-style model:token level andsequence level. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.
Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.
It is possible to control pooling behavior in some cases using thepooling_type
flag on model creation. You can ensure token level embeddings from any model usingLLAMA_POOLING_TYPE_NONE
. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
llm=Llama(model_path="./models/7B/llama-model.gguf",n_ctx=2048)
llama-cpp-python
offers a web server which aims to act as a drop-in replacement for the OpenAI API.This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).
To install the server package and get started:
pip install'llama-cpp-python[server]'python3 -m llama_cpp.server --model models/7B/llama-model.gguf
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install'llama-cpp-python[server]'python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
Navigate tohttp://localhost:8000/docs to see the OpenAPI documentation.
To bind to0.0.0.0
to enable remote connections, usepython3 -m llama_cpp.server --host 0.0.0.0
.Similarly, to change the port (default is 8000), use--port
.
You probably also want to set the prompt format. For chatml, use
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml
That will format the prompt according to how model expects it. You can find the prompt format in the model card.For possible options, seellama_cpp/llama_chat_format.py and look for lines starting with "@register_chat_format".
If you havehuggingface-hub
installed, you can also use the--hf_model_repo_id
flag to load a model from the Hugging Face Hub.
python3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model'*q8_0.gguf'
A Docker image is available onGHCR. To run the server:
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
Docker on termux (requires root) is currently the only known way to run this on phones, seetermux support issue
The low-level API is a directctypes
binding to the C API provided byllama.cpp
.The entire low-level API can be found inllama_cpp/llama_cpp.py and directly mirrors the C API inllama.h.
Below is a short example demonstrating how to use the low-level API to tokenize a prompt:
importllama_cppimportctypesllama_cpp.llama_backend_init(False)# Must be called once at the start of each programparams=llama_cpp.llama_context_default_params()# use bytes for char * paramsmodel=llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf",params)ctx=llama_cpp.llama_new_context_with_model(model,params)max_tokens=params.n_ctx# use ctypes arrays for array paramstokens= (llama_cpp.llama_token*int(max_tokens))()n_tokens=llama_cpp.llama_tokenize(ctx,b"Q: Name the planets in the solar system? A: ",tokens,max_tokens,llama_cpp.c_bool(True))llama_cpp.llama_free(ctx)
Check out theexamples folder for more examples of using the low-level API.
Documentation is available viahttps://llama-cpp-python.readthedocs.io/.If you find any issues with the documentation, please open an issue or submit a PR.
This package is under active development and I welcome any contributions.
To get started, clone the repository and install the package in editable / development mode:
git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.gitcd llama-cpp-python# Upgrade pip (required for editable mode)pip install --upgrade pip# Install with pippip install -e.# if you want to use the fastapi / openapi serverpip install -e'.[server]'# to install all optional dependenciespip install -e'.[all]'# to clear the local build cachemake clean
Now try running the tests
pytest
There's aMakefile
available with useful targets.A typical workflow would look like this:
make buildmaketest
You can also test out specific commits ofllama.cpp
by checking out the desired commit in thevendor/llama.cpp
submodule and then runningmake clean
andpip install -e .
again. Any changes in thellama.h
API will requirechanges to thellama_cpp/llama_cpp.py
file to match the new API (additional changes may be required elsewhere).
The recommended installation method is to install from source as described above.The reason for this is thatllama.cpp
is built with compiler optimizations that are specific to your system.Using pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.
That being said there are some pre-built binaries available through the Releases as well as some community provided wheels.
In the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area.This is currently being tracked in#741
I originally wrote this package for my own use with two goals in mind:
- Provide a simple process to install
llama.cpp
and access the full C API inllama.h
from Python - Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use
llama.cpp
Any contributions and changes to this package will be made with these goals in mind.
This project is licensed under the terms of the MIT license.
About
Python bindings for llama.cpp
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.