- Notifications
You must be signed in to change notification settings - Fork0
IBM GGUF-encoded AI models and conversion scripts
License
IBM/gguf
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository provides an automated CI/CD process to convert, test and deploy IBM Granite models, in safetensor format, from theibm-granite
organization to versioned IBM GGUF collections in Hugging Face Hub under theibm-research
organization. This includes:
- Target IBM models for format conversion
- GGUF Conversion & Quantization
- GGUF Verification Testing
- References
- Releasing GGUF model conversions & quantizations
Format conversions (i.e., GGUF) and quantizations will only be provided for canonically hosted model repositories hosted in an official IBM Huggingface organization.
Currently, this includes the following organizations:
Additionally, only a select set of IBM models from these orgs. will be converted based upon the following general criteria:
The IBM GGUF model needs to be referenced by an AI provider service as a "supported" model.
The GGUF model is referenced by a public blog, tutorial, demo, or other public use case.
- Specifically, if the model is referenced in an IBMGranite Snack Cookbook
Select quantization will only be made available when:
- Small form-factor is justified:
- e.g., Reduced model size intended running locally on small form-factor devices such as watches and mobile devices.
- Performance provides significant benefit without compromising on accuracy (or enabling hallucination).
Specifically, the following Granite model repositories are currently supported in GGUF format (by collection) with listed:
Typically, this model category includes "instruct" models.
Source Repo. ID | HF (llama.cpp) Architecture | Target Repo. ID |
---|---|---|
ibm-granite/granite-3.2-2b-instruct | GraniteForCausalLM (gpt2) | ibm-research |
ibm-granite/granite-3.2-8b-instruct | GraniteForCausalLM (gpt2) | ibm-research |
- Supported quantizations:
fp16
,Q2_K
,Q3_K_L
,Q3_K_M
,Q3_K_S
,Q4_0
,Q4_1
,Q4_K_M
,Q4_K_S
,Q5_0
,Q5_1
,Q5_K_M
,Q5_K_S
,Q6_K
,Q8_0
Source Repo. ID | HF (llama.cpp) Architecture | Target HF Org. |
---|---|---|
ibm-granite/granite-guardian-3.2-3b-a800m | GraniteMoeForCausalLM (granitemoe) | ibm-research |
ibm-granite/granite-guardian-3.2-5b | GraniteMoeForCausalLM (granitemoe) | ibm-research |
- Supported quantizations:
fp16
,Q4_K_M
,Q5_K_M
,Q6_K
,Q8_0
HF (llama.cpp) Architecture | Source Repo. ID | Target HF Org. |
---|---|---|
ibm-granite/granite-vision-3.2-2b | GraniteForCausalLM (granite), LlavaNextForConditionalGeneration | ibm-research |
- Supported quantizations:
fp16
,Q4_K_M
,Q5_K_M
,Q8_0
Source Repo. ID | HF (llama.cpp) Architecture | Target HF Org. |
---|---|---|
ibm-granite/granite-embedding-30m-english | Roberta (roberta-bpe) | ibm-research |
ibm-granite/granite-embedding-125m-english | Roberta (roberta-bpe) | ibm-research |
ibm-granite/granite-embedding-107m-multilingual | Roberta (roberta-bpe) | ibm-research |
ibm-granite/granite-embedding-278m-multilingual | Roberta (roberta-bpe) | ibm-research |
- Supported quantizations:
fp16
,Q8_0
Note: Sparse model architecture (i.e., HFRobertaMaskedLM
) is not currently supported; therefore, there is no conversion foribm-granite/granite-embedding-30m-sparse
.
- LoRA support is currently in plan (no date).
The GGUF format is defined in theGGUF specification. The specification describes the structure of the file, how it is encoded, and what information is included.
Currently, the primary means to convert from HF SafeTensors format to GGUF will be the canonical llama.cpp toolconvert-hf-to-gguf.py
.
for example:
python llama.cpp/convert-hf-to-gguf.py ./<model_repo> --outfile output_file.gguf --outtype q8_0
https://github.com/ollama/ollama/blob/main/docs/import.md#quantizing-a-model
$ ollama create --quantize q4_K_M mymodeltransferring model dataquantizing F16 model to Q4_K_Mcreating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bdcreating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0fwriting manifestsuccess
Note: The Ollama CLI tool only supports a subset of quantizations:- (rounding):q4_0
,q4_1
,q5_0
,q5_1
,q8_0
- k-means:q3_K_S
,q3_K_M
,q3_K_L
,q4_K_S
,q4_K_M
,q5_K_S
,q5_K_M
,q6_K
Note:
- Similar to Ollama CLI, the web UI supports only a subset of quantizations.
As a baseline, each converted model MUST successfully be run in the following providers:
llama.cpp - As the core implementation of the GGUF format which is either a direct dependency or utilized as forked code in most all downstream GGUF providers, testing is essential. Specifically, testing to verify the model can be hosted using thellama-server
service.-See the specific section onllama.cpp
for more details on which version is considered "stable" and how the same version will be used in both conversion and testing.
Ollama - As a key model service provider supported by higher level frameworks and platforms (e.g.,AnythingLLM,LM Studio etc.), testing the ability topull
andrun
the model is essential.
Notes
- The official Ollama Docker imageollama/ollama is available on Docker Hub.
- Ollama does not yet support sharded GGUF models
- "Ollama does not support this yet. Follow this issue for more info:ollama/ollama#5245"
- e.g.,
ollama pull hf.co/Qwen/Qwen2.5-14B-Instruct-GGUF
GGUF format
- Huggingface:GGUF - describes the format and some of the header structure.
- llama.cpp:
- GGUF Quantization types (
ggml_ftype
) -ggml/include/ggml.h
- GGUF Quantization types (
LlamaFileType
) -gguf-py/gguf/constants.py
- GGUF Quantization types (
GGUF Examples
GGUF tools
- GGUF-my-repo - Hugging Face space to build your own quants. without any setup.(ref. by llama.cpp example docs.)
- CISCai/gguf-editor - batch conversion tool for HF model repos. GGUF models.
llama.cpp Tutorials
- How to convert any HuggingFace Model to gguf file format? - using the
llama.cpp/convert-hf-to-gguf.py
conversion script.
- How to convert any HuggingFace Model to gguf file format? - using the
Ollama tutorials
- Importing a model - includes Safetensors, GGUF.
- Use Ollama with any GGUF Model on Hugging Face Hub
- Using Ollama models from Langchain - This example uses the
gemma2
model supported by Ollama.
This repository uses GitHub workflows and actions to convert IBM Granite models hosted on Huggingface to GGUF format, quantize them, run build-verification tests on the resultant models and publish them to target GGUF collections in IBM owned Huggingface organizations (e.g.,ibm-research
andibm-granite
).
There are 3 types of releases that can be performed on this repository:
- Test (private) - releases GGUF models to a test (or private) repo. on Huggingface.
- Preview (private) - releases GGUF models to a GGUF collection within the
ibm-granite
HF organization for time-limited access to select IBM partners (typically for pre-release testing and integration). - Public - releases GGUF models to a public GGUF collection within the
ibm-research
HF organization for general use.
Note:The Huggingface (HF) term "private" means that repos. and collections created in the target HF organization are only visible to organization contributors and not visible (or hidden) from normal users.
Prior to "triggering" release workflows, some files need to be configured depending on the release type.
Project maintainers for this repo. are able to access the secrets (tokens) that are made available to the CI/CD release workflows/actions:
https://github.com/IBM/gguf/settings/secrets/actions
Secrets are used to authenticate with Github and Huggingface (HF) and are already configured for theibm-granite
andibm-research
HF organizations for "preview" and "public" release types.
For "test" (or private) builds, users can fork the repo. and add a repository secret namedHF_TOKEN_TEST
with a token (value) created on their test (personal, private) HF organization account with appropriate privileges to allow write access to repos. and collections.
If you need to encode information for project CI GitHub workflows, please use the following macos command and assure there are no line breaks:
base64 -i <input_file> > <output_file>
Each release uses a model collection mapping file that defines which models repositories along with titles, descriptions and family designations belong to that collection. Family designations allow granular control over the which model families are included in a release which allows for "staggered" releases typically by model architecture (e.g.,vision
,embedding
, etc.).
Originally, different IBM Granite releases had their own collection mapping file; however, we now use a single collection mapping file for all releases of GGUF model formats for simpler downstream consumption:
- Unified mapping: (all release types)resources/json/latest/hf_collection_mapping_gguf.json
The JSON collection mapping files have the following structure using the "Public" release as an example:
{"collections": [ {"title":"Granite GGUF Models","description":"GGUF-formatted versions of IBM Granite models. Licensed under the Apache 2.0 license.","items": [ {"type":"model","family":"instruct","repo_name":"granite-3.3-8b-instruct" },... {"type":"model","family":"vision","repo_name":"granite-vision-3.2-2b" },... {"type":"model","family":"guardian","repo_name":"granite-guardian-3.2-3b-a800m" },... {"type":"model","family":"embedding","repo_name":"granite-embedding-30m-english" },... ] } ]}
Simple add a new object under theitems
array for each new IBM Granite repo. you want added to the corresponding (GGUF) collection.
Currently, the only HF item type supported ismodel
and valid families (which have supported workflows) include:instruct
(language),vision
,guardian
andembedding
.
Note: If you need to change the HF collection description, please know thatHF limits this string to150 chars. or less.
Each release type has a corresponding (parent, master) workflow that configures and controls which model family (i.e.,instruct
(language),vision
,guardian
andembedding
) are executed for a given GitHub (tagged) release.
For example, a3.2
versioned release uses the following files which correspond to one of the release types (i.e.,Test
,Preview
orPublic
):
- Test:.github/workflows/granite-3.2-release-test.yml
- Preview:.github/workflows/granite-3.2-release-preview-ibm-granite.yml
- Public:.github/workflows/granite-3.2-release-ibm-research.yml
The YAML GitHub workflow files have a few environment variables that may need to be updated to reflect which collections, models and quantizations should be included on the next, subsequent GitHub (tagged)release. Using the "Public" release YAML file as an example:
env:ENABLE_INSTRUCT_JOBS:falseENABLE_VISION_JOBS:falseENABLE_GUARDIAN_JOBS:trueSOURCE_INSTRUCT_REPOS:"[ 'ibm-granite/granite-3.2-2b-instruct', ... ]"TARGET_INSTRUCT_QUANTIZATIONS:"[ 'Q4_K_M', ... ]"SOURCE_GUARDIAN_REPOS:"[ 'ibm-granite/granite-guardian-3.2-3b-a800m', ... ]"TARGET_GUARDIAN_QUANTIZATIONS:"[ 'Q4_K_M', ... ]"SOURCE_VISION_REPOS:"[ 'ibm-granite/granite-vision-3.2-2b', ... ]"TARGET_VISION_QUANTIZATIONS:"[ 'Q4_K_M', ... ]"...COLLECTION_CONFIG:"resources/json/latest/hf_collection_mapping_gguf.json"
Note: that theCOLLECTION_CONFIG
env. var. provides the relative path to the collection configuration file, which is located in theresources/json
directory of the repository for the specific Granite release.
Clone and build the following llama.cpp binaries using these build/link flags:
The following command will create the proper CMakebuild
files for generating code that will run within bothmacos
andubuntu
container images. They also assure that the llama.cpp libraries will not attempt to use GPUs since the current GitHub Virtual Machines for both operating systems do not support this.
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_METAL=OFF -DGGML_NATIVE_DEFAULT=OFF -DCMAKE_CROSSCOMPILING=TRUE -DGGML_NO_ACCELERATE=ON
Note: As flags have changed often, the following minimal set of flags MAY work but needs testing:
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_NO_ACCELERATE=ON -DCMAKE_CROSSCOMPILING=TRUE
Use this command to build all llama.cpp tool binaries tobuild/bin
directory:
cmake --build build --config Release
Once built locally, copy the following files from yourbuild/bin
directory to this repository'sbin
directory:
- llama-cli
- llama-quantize
- llama-run
- llama-server
- llama-llava-cli(May no longer be needed/supported as of May 2025 as llava support has been rolled into general libs under multimodal support aka.
mtmd
)
This section contains the steps required to successfully "trigger" a release workflow for one or more supported Granite models families (i.e.,instruct
(language),vision
,guardian
andembedding
).
Click the
Releases
link from the right column of the repo. home page which should be the URLhttps://github.com/IBM/gguf/releases.Click the "Draft a new release" button near the top of the releases page.
Click the "Choose a tag" drop-down menu and enter a tag name that starts with one of the following strings relative to which release type you want to "trigger":
- Test:
test-v3.3
(private HF org.) - Preview:
preview-v3.3
(IBM Granite, private/hidden) - Public:
v3.3
(IBM Granite)
Treat these strings as "prefixes" which you must append a unique build version. For example:
v3.3-rc-01
for a release candidate version 01 under the IBM Granite org. on Hugging Face Hub.
- Test:
"Create a new tag: on publish" near the bottom of the drop-down list.
By convention, add the same "tag" name you created in the previous step into the "Release title" entry field.
Adjust the "Set as a pre-release" and "Set as the latest release" checkboxes to your desired settings.
Click the "Publish release" button.
At this point, you can observe the CI/CD workflows being run by the GitHub service "runners".Please note that during heavy traffic times, assignment of a "runner" (for each workflow job) may take longer.
To observe the CI/CD process in action, please navigate to the following URL:
and look for the name of thetag
you entered for the release (above) in the workflow run title.
Note
It is common to occasionally see some jobs "fail" due to network or scheduling timeout errors. In these cases, you can go into the failed workflow run and click on the "Re-run failed jobs" button to re-trigger the failed job(s).
About
IBM GGUF-encoded AI models and conversion scripts
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.