Description
This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:
Context
- The current HF conversion to GGUF does not work directly for Mistral models due to our format that is vLLM based. This means that we have to first convert weights to Hugging Face then to GGUF which is not ideal and can lead to conversion errors if the first conversion is not done correctly. It also means that adding new models to the llama.cpp ecosystem requires first adding them to Transformers.
- We do not support chat templates natively which means chat templates are community based and not guaranteed to work correctly.
- We are usingmistral-common internally for tokenization and want the community to use it to unlock full capacities of our models. As mistral-common is a Python library, we have opened aPR to add a REST API via FastAPI to make it easier for users who are not in the Python ecosystem.
Using mistral-common with llama.cpp
We recommend that users only use thellama-server
tool with the/completions
route of the server for now, as it is the only one that supports tokens input. We also advise users to setreturn_tokens=True
in their requests to letmistral-common
handle detokenization.
Added features
- Model conversion:
We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located atconvert_mistral_to_gguf.py
and can be used to convert Mistral models to GGUF format.
- Model architecture:
We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.
Known Limitations:
Our approach does not support multimodality:
- mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
- llama.cpp only supports multimodality via chat templates, which we do not support.
Also this approach requires users to only use the llama.cpp server with the/completions
route.
Example Code
To get started, install mistral-common using the following command:
(Optional) Convert the model
HF_TOKEN=...pythonconvert_mistral_to_gguf.py \mistralai/Devstral-Small-2505--remote--ctx-train131072--outtypebf16
Launch the mistral-common and llama.cpp servers
pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]
Launch the mistral-common server:
HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000
Launch the llama.cpp server:
./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080
Use the servers
Here is a code snippet demonstrating how to use the new features:
importrequestsmistral_common_url="http://127.0.0.1:6000"llama_cpp_url="http://127.0.0.1:8080"deftokenize(messages,url):response=requests.post(f"{url}/tokenize/messages",json=messages)returnresponse.json()defdetokenize(tokens,url):response=requests.post(f"{url}/detokenize",json={"tokens":tokens})returnresponse.json()defdetokenize_message(tokens,url):response=requests.post(f"{url}/detokenize",json={"tokens":tokens,"as_message":True})returnresponse.json()defgenerate(tokens,url):response=requests.post(f"{url}/completions",json={"prompt":tokens,"stream":False,"return_tokens":True })returnresponse.json()messages= [ {"role":"system","content":"You are Devstral a cool coding agent that can help users with their coding needs."}, {"role":"user","content":"Who are you and what can you do?"}]tokens=tokenize(messages,mistral_common_url)print(tokens)generated=generate(tokens,llama_cpp_url)["tokens"]print(generated)detokenized=detokenize(generated,mistral_common_url)print(detokenized)detokenized_message=detokenize_message(generated,mistral_common_url)print(detokenized_message)
Feedback and Contributions
We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.
Uh oh!
There was an error while loading.Please reload this page.
Description
This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:
Context
Using mistral-common with llama.cpp
We recommend that users only use the
llama-server
tool with the/completions
route of the server for now, as it is the only one that supports tokens input. We also advise users to setreturn_tokens=True
in their requests to letmistral-common
handle detokenization.Added features
We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at
convert_mistral_to_gguf.py
and can be used to convert Mistral models to GGUF format.We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.
Known Limitations:
Our approach does not support multimodality:
Also this approach requires users to only use the llama.cpp server with the
/completions
route.Example Code
To get started, install mistral-common using the following command:
(Optional) Convert the model
Launch the mistral-common and llama.cpp servers
Launch the mistral-common server:
Launch the llama.cpp server:
Use the servers
Here is a code snippet demonstrating how to use the new features:
Feedback and Contributions
We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.