IST-DASLab/QUIKPublic

NotificationsYou must be signed in to change notification settings
Fork13
Star180

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024

License

Apache-2.0 license

180 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
experiments		experiments
include		include
quik		quik
src		src
test		test
third-party		third-party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Repository files navigation

QUIK

This repository contains the code for QUIK, a method for quantizing the majority of the weights and activations to4bit post-training.

QUIK is described in the following paper:https://arxiv.org/abs/2310.09259

Install

Dependencies

cmake
C++ compiler (GCC/clang/...)
nvcc

Instructions

git clone https://github.com/IST-DASLab/QUIK.gitcd QUIKpip install -e.# or pip install .

Example

LLama example

cd experimentspip install -r requirements.txtpython llama.py --fp_features_num 256 --model meta-llama/Llama-2-7b-hf --hf_token<your_hf_token> --dataset c4\--w_bits 4 --w_clip --a_bits 4 --save_qmodel_path save_gptq_model_path --int8_down_proj --sim_eval --benchmark

Benchmark will be run on all available GPUs.

Linear layer benchmarks

Linear layer benchmarks can be run withpython layer_benchmark.py. One can vary input size with command line parameters.

Model adapt to QUIK

First, one has to quantize the model weights using GPTQ algorithm. Inllama.py it is done withllama_sequential function.From that we get quantized weights (that are still stored intorch.float16).Then ones needs create QUIK Linear layers usingqlinear.MixedQLinear.from_float that must replace original Linear layers. Seellama_replace_with_kernels inllama.py.Now the quantized model is ready for use.

Fake Quantization examples

To run the fake quantization example, checkfake_quant directory.

Citation

The full paper is available on arxiv. The full citation is

@article{QUIK,  title={QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models},  author={Ashkboos, Saleh and Markov, Ilia and Frantar, Elias and Zhong, Tingxuan and Wang, Xincheng and Ren, Jie and Hoefler, Torsten and Alistarh, Dan},  journal={arXiv preprint arXiv:2310.09259},  year={2023}}

About

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

QUIK

Install

Dependencies

Instructions

Example

LLama example

Linear layer benchmarks

Model adapt to QUIK

Fake Quantization examples

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors6

Uh oh!

Languages

Movatterモバイル変換

License

IST-DASLab/QUIK

Folders and files

Latest commit

History

Repository files navigation

QUIK

Install

Dependencies

Instructions

Example

LLama example

Linear layer benchmarks

Model adapt to QUIK

Fake Quantization examples

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors6

Uh oh!

Languages

Packages