Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024

License

NotificationsYou must be signed in to change notification settings

IST-DASLab/QUIK

Repository files navigation

This repository contains the code for QUIK, a method for quantizing the majority of the weights and activations to4bit post-training.

QUIK is described in the following paper:https://arxiv.org/abs/2310.09259

Install

Dependencies

  • cmake
  • C++ compiler (GCC/clang/...)
  • nvcc

Instructions

git clone https://github.com/IST-DASLab/QUIK.gitcd QUIKpip install -e.# or pip install .

Example

LLama example

cd experimentspip install -r requirements.txtpython llama.py --fp_features_num 256 --model meta-llama/Llama-2-7b-hf --hf_token<your_hf_token> --dataset c4\--w_bits 4 --w_clip --a_bits 4 --save_qmodel_path save_gptq_model_path --int8_down_proj --sim_eval --benchmark

Benchmark will be run on all available GPUs.

Linear layer benchmarks

Linear layer benchmarks can be run withpython layer_benchmark.py. One can vary input size with command line parameters.

Model adapt to QUIK

First, one has to quantize the model weights using GPTQ algorithm. Inllama.py it is done withllama_sequential function.From that we get quantized weights (that are still stored intorch.float16).Then ones needs create QUIK Linear layers usingqlinear.MixedQLinear.from_float that must replace original Linear layers. Seellama_replace_with_kernels inllama.py.Now the quantized model is ready for use.

Fake Quantization examples

To run the fake quantization example, checkfake_quant directory.

Citation

The full paper is available on arxiv. The full citation is

@article{QUIK,  title={QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models},  author={Ashkboos, Saleh and Markov, Ilia and Frantar, Elias and Zhong, Tingxuan and Wang, Xincheng and Ren, Jie and Hoefler, Torsten and Alistarh, Dan},  journal={arXiv preprint arXiv:2310.09259},  year={2023}}

About

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp