- Notifications
You must be signed in to change notification settings - Fork14
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
License
IST-DASLab/QUIK
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the code for QUIK, a method for quantizing the majority of the weights and activations to4bit post-training.
QUIK is described in the following paper:https://arxiv.org/abs/2310.09259
- cmake
- C++ compiler (GCC/clang/...)
- nvcc
git clone https://github.com/IST-DASLab/QUIK.gitcd QUIKpip install -e.# or pip install .
cd experimentspip install -r requirements.txtpython llama.py --fp_features_num 256 --model meta-llama/Llama-2-7b-hf --hf_token<your_hf_token> --dataset c4\--w_bits 4 --w_clip --a_bits 4 --save_qmodel_path save_gptq_model_path --int8_down_proj --sim_eval --benchmark
Benchmark will be run on all available GPUs.
Linear layer benchmarks can be run withpython layer_benchmark.py
. One can vary input size with command line parameters.
First, one has to quantize the model weights using GPTQ algorithm. Inllama.py
it is done withllama_sequential
function.From that we get quantized weights (that are still stored intorch.float16
).Then ones needs create QUIK Linear layers usingqlinear.MixedQLinear.from_float
that must replace original Linear layers. Seellama_replace_with_kernels
inllama.py
.Now the quantized model is ready for use.
To run the fake quantization example, checkfake_quant
directory.
The full paper is available on arxiv. The full citation is
@article{QUIK, title={QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models}, author={Ashkboos, Saleh and Markov, Ilia and Frantar, Elias and Zhong, Tingxuan and Wang, Xincheng and Ren, Jie and Hoefler, Torsten and Alistarh, Dan}, journal={arXiv preprint arXiv:2310.09259}, year={2023}}