Optimum documentation

Quantization

Optimum

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v1.27.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Quantization

Quantization is a technique to reduce the computational and memory costs of running inference by representing theweights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floatingpoint (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), andoperations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run modelson embedded devices, which sometimes only support integer data types.

Theory

The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bitfloating-point) for weights and activations to a lower precision data type. The most common lower precision data typesare:

float16, accumulation data typefloat16
bfloat16, accumulation data typefloat32
int16, accumulation data typeint32
int8, accumulation data typeint32

The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc) values of thedata type in question. For example, let’s consider twoint8 valuesA = 127,B = 127, and let’s defineC as thesum ofA andB:

C = A + B

Here the result is much bigger than the biggest representable value inint8, which is127. Hence the need for a largerprecision data type to avoid a huge precision loss that would make the whole quantization process useless.

Quantization

The two most common quantization cases arefloat32 -> float16 andfloat32 -> int8.

Quantization to float16

Performing quantization to go fromfloat32 tofloat16 is quite straightforward since both data types follow the samerepresentation scheme. The questions to ask yourself when quantizing an operation tofloat16 are:

Does my operation have afloat16 implementation?
Does my hardware supportfloat16? For instance, Intel CPUshave been supportingfloat16 as a storage type, butcomputation is done after converting tofloat32. Full support will comein Cooper Lake and Sapphire Rapids.
Is my operation sensitive to lower precision?For instance the value of epsilon inLayerNorm is usually very small (~1e-12), but the smallest representable value infloat16 is ~6e-5, this can causeNaN issues. The same applies for big values.

Quantization to int8

Performing quantization to go fromfloat32 toint8 is more tricky. Only 256 values can be represented inint8,whilefloat32 can represent a very wide range of values. The idea is to find the best way to project our range[a, b]offloat32 values to theint8 space.

Let’s consider a floatx in[a, b], then we can write the following quantization scheme, also called theaffinequantization scheme:

x= S * (x_q - Z)

where:

x_q is the quantizedint8 value associated tox
S andZ are the quantization parameters
- S is the scale, and is a positivefloat32
- Z is called the zero-point, it is theint8 value corresponding to the value0 in thefloat32 realm. This isimportant to be able to represent exactly the value0 because it is used everywhere throughout machine learningmodels.

The quantized valuex_q ofx in[a, b] can be computed as follows:

x_q =round(x/S +Z)

Andfloat32 values outside of the[a, b] range are clipped to the closest representable value, so for anyfloating-point numberx:

x_q =clip(round(x/S + Z),round(a/S + Z),round(b/S + Z))

Usuallyround(a/S + Z) corresponds to the smallest representable value in the considered data type, andround(b/S + Z)to the biggest one. But this can vary, for instance when using asymmetric quantization scheme as you will see in the nextparagraph.

Symmetric and affine quantization schemes

The equation above is called theaffine quantization scheme because the mapping from[a, b] toint8 is an affine one.

A common special case of this scheme is thesymmetric quantization scheme, where we consider a symmetric range of float values[-a, a].In this case the integer space is usually[-127, 127], meaning that the-128 is opted out of the regular[-128, 127] signedint8 range.The reason being that having a symmetric range allows to haveZ = 0. While one value out of the 256 representablevalues is lost, it can provide a speedup since a lot of addition operations can be skipped.

Note: To learn how the quantization parametersS andZ are computed, you can read theQuantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inferencepaper, orLei Mao’s blog post on the subject.

Per-tensor and per-channel quantization

Depending on the accuracy / latency trade-off you are targeting you can play with the granularity of the quantization parameters:

Quantization parameters can be computed on aper-tensor basis, meaning that one pair of(S, Z) will be used pertensor.
Quantization parameters can be computed on aper-channel basis, meaning that it is possible to store a pair of(S, Z) per element along one of the dimensions of a tensor. For example for a tensor of shape[N, C, H, W], havingper-channel quantization parameters for the second dimension would result in havingC pairs of(S, Z). While thiscan give a better accuracy, it requires more memory.

Calibration

The section above described how quantization fromfloat32 toint8 works, but one questionremains: how is the[a, b] range offloat32 values determined? That is where calibration comes in to play.

Calibration is the step during quantization where thefloat32 ranges are computed. For weights it is quite easy sincethe actual range is known atquantization-time. But it is less clear for activations, and different approaches exist:

Post trainingdynamic quantization: the range for each activation is computed on the fly atruntime. While thisgives great results without too much work, it can be a bit slower than static quantization because of the overheadintroduced by computing the range each time.It is also not an option on certain hardware.
Post trainingstatic quantization: the range for each activation is computed in advance atquantization-time,typically by passing representative data through the model and recording the activation values. In practice, the steps are:
1. Observers are put on activations to record their values.
2. A certain number of forward passes on a calibration dataset is done (around200 examples is enough).
3. The ranges for each computation are computed according to somecalibration technique.
Quantization aware training: the range for each activation is computed attraining-time, following the same ideathan post training static quantization. But “fake quantize” operators are used instead of observers: they recordvalues just as observers do, but they also simulate the error induced by quantization to let the model adapt to it.

For both post training static quantization and quantization aware training, it is necessary to define calibrationtechniques, the most common are:

Min-max: the computed range is[min observed value, max observed value], this works well with weights.
Moving average min-max: the computed range is[moving average min observed value, moving average max observed value],this works well with activations.
Histogram: records a histogram of values along with min and max values, then chooses according to some criterion:
- Entropy: the range is computed as the one minimizing the error between the full-precision and the quantized data.
- Mean Square Error: the range is computed as the one minimizing the mean square error between the full-precision andthe quantized data.
- Percentile: the range is computed using a given percentile valuep on the observed values. The idea is to try to havep% of the observed values in the computed range. While this is possible when doing affine quantization, it is not alwayspossible to exactly match that when doing symmetric quantization. You can checkhow it is done in ONNXRuntimefor more details.

Practical steps to follow to quantize a model to int8

To effectively quantize a model toint8, the steps to follow are:

Choose which operators to quantize. Good operators to quantize are the one dominating in terms of computation time,for instance linear projections and matrix multiplications.
Try post-training dynamic quantization, if it is fast enough stop here, otherwise continue to step 3.
Try post-training static quantization which can be faster than dynamic quantization but often with a drop in terms ofaccuracy. Apply observers to your models in places where you want to quantize.
Choose a calibration technique and perform it.
Convert the model to its quantized form: the observers are removed and thefloat32 operators are converted to theirint8counterparts.
Evaluate the quantized model: is the accuracy good enough? If yes, stop here, otherwise start again at step 3 butwith quantization aware training this time.

Supported tools to perform quantization in 🤗 Optimum

🤗 Optimum provides APIs to perform quantization using different tools for different targets:

Theoptimum.onnxruntime package allows toquantize and run ONNX models using theONNX Runtime tool.
Theoptimum.intel package enables toquantize 🤗 Transformersmodels while respecting accuracy and latency constraints.
Theoptimum.fx package provides wrappers around thePyTorch quantization functionsto allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the twomentioned above, giving more flexibility, but requiring more work on your end.
Theoptimum.gptq package allows toquantize and run LLM models with GPTQ.

Going further: How do machines represent numbers?

The section is not fundamental to understand the rest. It explains in brief how numbers are represented in computers.Since quantization is about going from one representation to another, it can be useful to have some basics, but it isdefinitely not mandatory.

The most fundamental unit of representation for computers is the bit. Everything in computers is represented as asequence of bits, including numbers. But the representation varies whether the numbers in question are integers orreal numbers.

Integer representation

Integers are usually represented with the following bit lengths:8,16,32,64. When representing integers, two casesare considered:

Unsigned (positive) integers: they are simply represented as a sequence of bits. Each bit corresponds to a powerof two (from0 ton-1 wheren is the bit-length), and the resulting number is the sum of those powers of two.

Example:19 is represented as an unsigned int8 as00010011 because :

19 =0 x2^7 +0 x2^6 +0 x2^5 +1 x2^4 +0 x2^3 +0 x2^2 +1 x2^1 +1 x2^0

Signed integers: it is less straightforward to represent signed integers, and multiple approaches exist, the mostcommon being thetwo’s complement. For more information, you can check theWikipedia page on the subject.

Real numbers representation

Real numbers are usually represented with the following bit lengths:16,32,64.The two main ways of representing real numbers are:

Fixed-point: there are fixed number of digits reserved for representing the integer part and the fractional part.
Floating-point: the number of digits for representing the integer and the fractional parts can vary.

The floating-point representation can represent bigger ranges of values, and this is the one we will be focusing onsince it is the most commonly used. There are three components in the floating-point representation:

The sign bit: this is the bit specifying the sign of the number.
The exponent part

5 bits infloat16
8 bits inbfloat16
8 bits infloat32
11 bits infloat64

The mantissa

11 bits infloat16 (10 explicitly stored)
8 bits inbfloat16 (7 explicitly stored)
24 bits infloat32 (23 explicitly stored)
53 bits infloat64 (52 explicitly stored)

For more information on the bits allocation for each data type, check the nice illustration on the Wikipedia page aboutthebfloat16 floating-point format.

For a real numberx we have:

x= signx mantissax (2^exponent)

References

TheQuantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference paper
TheBasics of Quantization in Machine Learning (ML) for Beginnersblog post
TheHow to accelerate and compress neural networks with quantizationblog post
The Wikipedia pages on integers representationhere andhere
The Wikipedia pages on

←Notebooks 🤗 Optimum Nvidia→

Movatterモバイル変換

Optimum

Quantization

Theory

Quantization

Quantization to float16

Quantization to int8

Symmetric and affine quantization schemes

Per-tensor and per-channel quantization

Calibration

Practical steps to follow to quantize a model to int8

Supported tools to perform quantization in 🤗 Optimum

Going further: How do machines represent numbers?

Integer representation

Real numbers representation

References