Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Tensor Encoding Schemes

Brian edited this pageJul 31, 2024 ·9 revisions

There isn't really a real writeup of all the mapping, but this should hopefully be a good central starting point if any maintainers needs some understanding of when each feature was added and the general specs of each. Updating it with more accurate information is greatly appreciated

Tensor Naming Scheme

This is not definitive, but is helpful when reading sourcecode or console output to understand what each means typically.

  • <Encoding>_<Variants>
    • <Encoding> : This defines the most common encoding of individual weights in the model
      • Floating Point Formats:
        • BF16:16-bit bfloat16Google Brain truncated form of 32-bit IEEE 754 (1 sign bit, 8 exponent bits, 7 fractional bits)
        • F64:64-bit IEEE 754 floats per weight (1 sign bit, 11 exponent bits, 52 fractional bits)
        • F32:32-bit IEEE 754 floats per weight (1 sign bit, 8 exponent bits, 23 fractional bits)
        • F16:16-bit IEEE 754 floats per weight (1 sign bit, 5 exponent bits, 10 fractional bits)
      • Integer formats:
        • I<X>: X bits per weight, whereX could be4 (for 4 bits) or8 (for 8 bits) etc...
      • Quantized formats:
        • Q<X>: X bits per weight, whereX could be4 (for 4 bits) or8 (for 8 bits) etc...
        • KQ<X> (orQ<X>_K) : k-quant based models. X bits per weight, whereX could be4 (for 4 bits) or8 (for 8 bits) etc...
        • IQ<X>: i-quant based models. X bits per weight, whereX could be4 (for 4 bits) or8 (for 8 bits) etc...
    • <Variants>: This represents different strategies of packing quantized weights into a gguf file. This is because we may want a mix of different bit sizes for weights of varying importance, or we may be encoding a general offset to a block or super-block. This may be omitted if trivial or initial attempt, refer to encoding scheme name table for details.

Tensor Encoding Scheme PR

Tensor Encoding Scheme Mapping

Schemeggml_ftype C enumeration nameggml_type C enum nameBits/WeightData TypeBlock ConfigurationQuantized Weight FormulaInitial Commits Or Pull Request Sources (ofggml_type)
BF16GGML_FTYPE_MOSTLY_BF16GGML_TYPE_BF1616bfloat16 (trunc 32b IEEE754)Homogonous Array Of Floating Weights-llama.cpp PR: Introduce bfloat16 support #6412
F16GGML_FTYPE_MOSTLY_F16GGML_TYPE_F161616-bit IEEE 754Homogonous Array Of Floating Weights-llama.cpp CM: Initial Release
F32GGML_FTYPE_ALL_F32GGML_TYPE_F323232-bit IEEE 754Homogonous Array Of Floating Weights-llama.cpp CM: Initial Release
F64-GGML_TYPE_F646464-bit IEEE 754Homogonous Array Of Floating Weights-llama.cpp CM: Add support for I64 and F64 arrays #6062
I8-GGML_TYPE_I88(signed?) integer--llama.cpp PR: Designate enum vals for integer types #6050
I16-GGML_TYPE_I1616(signed?) integer--llama.cpp PR: Designate enum vals for integer types #6050
I32-GGML_TYPE_I3232(signed?) integer--llama.cpp PR: Designate enum vals for integer types #6050
I64-GGML_TYPE_I6464(signed?) integer--llama.cpp PR: Add support for I64 and F64 arrays #6062
Q4_0GGML_FTYPE_MOSTLY_Q4_0GGML_TYPE_Q4_04round to nearest quantizationEach block has 32 weightsw = q * block_scalellama.cpp CM: Initial Release
Q4_1GGML_FTYPE_MOSTLY_Q4_1GGML_TYPE_Q4_14round to nearest quantizationEach block has 32 weightsw = q * block_scale + block_minimumllama.cpp CM: Initial Release
Q4_1_F16GGML_FTYPE_MOSTLY_Q4_1_SOME_F16-4round to nearest quantizationEach block has 32 weights (token embedding and output weights are F16)w = q * block_scale + block_minimumllama.cpp CM: add Q5 WASM SIMD + GGML_FTYPE
Q8_0GGML_FTYPE_MOSTLY_Q8_0GGML_TYPE_Q8_08round to nearest quantizationEach block has 32 weightsw = q * block_scalellama.cpp PR: Add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179
Q8_1-GGML_TYPE_Q8_18round to nearest quantizationEach block has 32 weightsw = q * block_scale + block_minimumllama.cpp PR: Add Q8_0 quantization for intermediate results #951 (Note: Renamed to Q8_1 in later commit)
Q5_0GGML_FTYPE_MOSTLY_Q5_0GGML_TYPE_Q5_05round to nearest quantizationEach block has 32 weightsw = q * block_scalellama.cpp PR: Add Q5_0 and Q5_1 quantization #1187
Q5_1GGML_FTYPE_MOSTLY_Q5_1GGML_TYPE_Q5_15round to nearest quantizationEach block has 32 weightsw = q * block_scale + block_minimumllama.cpp PR: Add Q5_0 and Q5_1 quantization #1187
Q2_KGGML_FTYPE_MOSTLY_Q2_KGGML_TYPE_Q2_K2.5625k-quantizationSuperblocks has 16 blocks ( 16 weights per block)w = q * block_scale (4-bit) + block_min (4-bit)llama.cpp PR: k-quants #1684
Q3_KGGML_FTYPE_MOSTLY_Q3_KGGML_TYPE_Q3_K3.4375k-quantizationSuperblocks has 16 blocks ( 16 weights per block)w = q * block_scale (6-bit)llama.cpp PR: k-quants #1684
Q4_KGGML_FTYPE_MOSTLY_Q4_KGGML_TYPE_Q4_K4.5k-quantizationSuperblocks has 8 blocks ( 32 weights per block)w = q * block_scale (6-bit) + block_min (6-bit)llama.cpp PR: k-quants #1684
Q5_KGGML_FTYPE_MOSTLY_Q5_KGGML_TYPE_Q5_K5.5k-quantizationSuperblocks has 8 blocks ( 32 weights per block)w = q * block_scale (6-bit) + block_min (6-bit)llama.cpp PR: k-quants #1684
Q6_KGGML_FTYPE_MOSTLY_Q6_KGGML_TYPE_Q6_K6.5625k-quantizationSuperblocks has 16 blocks ( 16 weights per block)w = q * block_scale (8-bit)llama.cpp PR: k-quants #1684
Q8_K-GGML_TYPE_Q8_K8.0k-quantizationSuperblocks has 1 blocks (256 weights per block) (Only used for intermediate quants)w = q * block_scale (8-bit)llama.cpp PR: k-quants #1684
IQ1_SGGML_FTYPE_MOSTLY_IQ1_SGGML_TYPE_IQ1_S1.5i-quantizationSuperblocks has 8 blocks ( 32 weights per block)w = func(superblock_scale, importance_matrix)llama.cpp PR: 1.5 bit quantization #5453
IQ1_MGGML_FTYPE_MOSTLY_IQ1_MGGML_TYPE_IQ1_M1.75i-quantizationSuperblocks has 16 blocks ( 16 weights per block)w = func(superblock_scale, importance_matrix)llama.cpp PR: IQ1_M: 1.75 bpw quantization #6302
IQ2_XXSGGML_FTYPE_MOSTLY_IQ2_XXSGGML_TYPE_IQ2_XXS2.0625i-quantizationSuperblocks has 8 blocks ( 32 weights per block)w = func(superblock_scale, importance_matrix)llama.cpp PR: SOTA 2-bit quants #4773
IQ2_XSGGML_FTYPE_MOSTLY_IQ2_XSGGML_TYPE_IQ2_XS2.31i-quantizationSuperblocks has 16 blocks ( 16 weights per block)w = func(superblock_scale, importance_matrix)llama.cpp PR: SOTA 2-bit quants - part 2 #4856
IQ2_SGGML_FTYPE_MOSTLY_IQ2_SGGML_TYPE_IQ2_S2.5i-quantization?w = func(superblock_scale, importance_matrix)llama.cpp PR: Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721
IQ3_SGGML_FTYPE_MOSTLY_IQ3_SGGML_TYPE_IQ3_S3.4375i-quantization?w = func(superblock_scale, importance_matrix)llama.cpp PR: IQ3_S: a much better alternative to Q3_K #5676
IQ3_XXSGGML_FTYPE_MOSTLY_IQ3_XXSGGML_TYPE_IQ3_XXS3.0625i-quantizationSuperblocks has 8 blocks ( 32 weights per block)w = func(superblock_scale, importance_matrix)llama.cpp PR: SOTA 3-bit quants #5196
IQ4_NLGGML_FTYPE_MOSTLY_IQ4_NLGGML_TYPE_IQ4_NL4.5i-quantizationSuperblocks has 16 blocks ( 16 weights per block)w = [non linear mapping of quants to weights]llama.cpp PR: IQ4_NL: 4-bit non-linear quants with blocks of 32 #5590
IQ4_XSGGML_FTYPE_MOSTLY_IQ4_XSGGML_TYPE_IQ4_XS4.25i-quantizationSuperblocks has 8 blocks ( 32 weights per block)w = func(superblock_scale, importance_matrix)llama.cpp PR: IQ4_XS: a 4.25 bpw quantization #5747
  • All superblocks have fp16 scaling factor and contains up to 256 weights. Number of weights in a block must be divisible by 256. (To be confirmed)

Where to find the structure of these tensors in the code?

You would find it all usually inggml-common.h where it typically be of this form

Blocks

#defineQK4_0 32typedefstruct {ggml_halfd;// deltauint8_tqs[QK4_0 /2];// nibbles / quants}block_q4_0;static_assert(sizeof(block_q4_0)==sizeof(ggml_half)+QK4_0 /2,"wrong q4_0 block size/padding");

Superblocks

//// Super-block quantization structures//// 2-bit quantization// weight is represented as x = a * q + b// 16 blocks of 16 elements each// Effectively 2.625 bits per weighttypedefstruct {uint8_tscales[QK_K/16];// scales and mins, quantized with 4 bitsuint8_tqs[QK_K/4];// quantsunion {struct {ggml_halfd;// super-block scale for quantized scalesggml_halfdmin;// super-block scale for quantized mins        }GGML_COMMON_AGGR;ggml_half2dm;    };}block_q2_K;static_assert(sizeof(block_q2_K)==2*sizeof(ggml_half)+QK_K/16+QK_K/4,"wrong q2_K block size/padding");

How are these tensors packed?

This is as explained bycompilade inthis thread.

Regarding how to find the bit pattern structure of a packed tensor block in the gguf file... there isn't a consistent encoding scheme for each block as sometimes a single field in the structs stores multiple types of values, like inQ4_K whereblock_q4_K.scales stores 6-bit scales and mins in some pattern. The easiest way to understand what the bits mean is to have a look at the respectivedequantize_row function of each type.

block_q4_K.scales packing example

The 12 bytes in Q4_K.scales are packed a bit like this, where the uppercased letters are bits for the scales and lowercased letters are the bits of the mins as seen below, which corresponds to this functionas shown here:

0: EEAAAAAA1: FFBBBBBB2: GGCCCCCC3: HHDDDDDD4: eeaaaaaa5: ffbbbbbb6: ggcccccc7: hhdddddd8: eeeeEEEE9: ffffFFFF10: ggggGGGG11: hhhhHHHH

Note that this is packing a 6bit scale and mins but split across multiple bytes. This use of byte offsets and bitwise operations is likely done to be more friendlier for SIMD processing. Ascompilade noted, he believes that the indexing is only done at the byte level, hence the packing and unpacking of the 6-bit values in this block will require bitwise operations. In his anecdotal experience he also noticed that when making the vec_dot of Q1_3, that shuffles are surprisingly as fast as additions in SIMD.

Users Guide

Useful information for users that doesn't fit into Readme.

Technical Details

These are information useful for Maintainers and Developers which does not fit into code comments

Github Actions Main Branch Status

Click on a badge to jump to workflow.This is here as a useful general view of all the actions so that we may notice quicker if main branch automation is broken and where.

  • bench action status
  • build action status
  • close-issue action status
  • code-coverage action status
  • docker action status
  • editorconfig action status
  • gguf-publish action status
  • labeler action status
  • nix-ci-aarch64 action status
  • nix-ci action status
  • nix-flake-update action status
  • nix-publish-flake action status
  • python-check-requirements action status
  • python-lint action status
  • server action status

Clone this wiki locally


[8]ページ先頭

©2009-2025 Movatter.jp