- Notifications
You must be signed in to change notification settings - Fork42
VPTQ, A Flexible and Extreme low-bit quantization algorithm
License
microsoft/VPTQ
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Efficient, Flexible and Compressing LLM in less than 2bits
- TL;DR
- News
- Installation
- Evaluation
- Tech Report
- Road Map
- Project main members:
- Acknowledgement
- Publication
- Star History
- Limitation of VPTQ
- Contributing
- Trademarks
Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leveragesVector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit).VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
- Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit)
- Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
- Agile Quantization Inference: low decode overhead, best throughput, and TTFT
- [2024-11-01] 📦 VPTQ is now available onPyPI! You can install it easily using the command:
pip install vptq
. - [2024-10-28] ✨ VPTQ algorithm early-released atalgorithm branch, and checkout thetutorial.
- [2024-10-22] 🌐 Open source community contributesMeta Llama 3.1 Nemotron 70B models, checkhow VPTQ counts 'r' on local GPU. We are continuing to work on quantizing the 4-6 bit versions. Please stay tuned!
- [2024-10-21] 🌐 Open source community contributesMeta Llama 3.1 405B @ 3/4 bits models
- [2024-10-18] 🌐 Open source community contributesMistral Large Instruct 2407 (123B) models
- [2024-10-14] 🚀 Add earlyROCm support.
- [2024-10-06] 🚀Try VPTQ on Google Colab.
- [2024-10-05] 🚀Add free Huggingface Demo:Huggingface Demo
- [2024-10-04] ✏️ Updated the VPTQ tech report and fixed typos.
- [2024-09-20] 🌐 Inference code is now open-sourced on GitHub—join us and contribute!
- [2024-09-20] 🎉 VPTQ paper has been accepted for the main track at EMNLP 2024.
- python 3.10+
- torch >= 2.2.0
- transformers >= 4.44.0
- Accelerate >= 0.33.0
- flash_attn >= 2.5.0
- latest datasets
recommend For saving your time to build the package, Please install VPTQ from the latest Release directly
pip install vptq
or from
https://github.com/microsoft/VPTQ/releases
[Not Aavailbe if Release package]
Preparation steps that might be needed: Set up CUDA_HOME and PATH.
Setcuda-12
to your own CUDA version and environment. Runnvcc --version
to find out your version, andwhich nvcc
to check your CUDA PATH.
# exampleexport CUDA_HOME=/usr/local/cuda-12export PATH=/usr/local/cuda-12/bin/:$PATH# set dependent on your environment
Will Take several minutes to compile CUDA kernels, please be patient. Current compilation builds on SM 7.0, 7.5, 8.0, 8,6, 9.0 to reduce the compilation time. You can setTORCH_CUDA_ARCH_LIST
to your specific architecture.
pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
You can configure the required CUDA architectures and the number of nvcc compile threads by setting
TORCH_CUDA_ARCH_LIST=8.0,9.0 NVCC_THREADS=16 pip install -e. --no-build-isolation
to reduce compilation time.
Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time
VPTQ is an ongoing project. If the open-source community is interested in optimizing and expanding VPTQ, please feel free to submit an issue or DM.
Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):
Model Naming Convention: The model's name includes thevector length
$v$ ,codebook (lookup table) size, andresidual codebook size. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" is "Meta-Llama-3.1-70B-Instruct", where:- Vector Length: 8
- Number of Centroids: 65536 (2^16)
- Number of Residual Centroids: 256 (2^8)
Equivalent Bitwidth Calculation:
- Index: log2(65536) = 16 / 8 = 2 bits
- Residual Index: log2(256) = 8 / 8 = 1 bit
- Total Bitwidth: 2 + 1 = 3 bits
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB
Note: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer toTech Report Appendix C.2.
To generate text using the pre-trained model, you can use the following code snippet:
The modelVPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft (~2 bit) is provided by open source community. The repository cannot guarantee the performance of those models.
python-mvptq--model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft--prompt="Explain: Do Not Go Gentle into That Good Night"
Launching a chatbot:Note that you must use a chat model for this to work
python-mvptq--model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft--chat
Using the Python API:
importvptqimporttransformerstokenizer=transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft")m=vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft",device_map='auto')inputs=tokenizer("Explain: Do Not Go Gentle into That Good Night",return_tensors="pt").to("cuda")out=m.generate(**inputs,max_new_tokens=100,pad_token_id=2)print(tokenizer.decode(out[0],skip_special_tokens=True))
An environment variable is available to control share link or not.export SHARE_LINK=1
python -m vptq.app
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
Read tech report atTech Report andarXiv Paper
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
---|---|---|---|---|---|---|---|
LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 | |
LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 | |
LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
- Merge the quantization algorithm into the public repository.
- Release onPypi
- Improve the implementation of the inference kernel (e.g., CUDA, ROCm, Triton) and apply kernel fusion by combining dequantization (lookup) and Linear (GEMM) to enhance inference performance.
- Support VLM models @YangWang92
- Contribute VPTQ toHuggingface Transformers
- Contribute VPTQ to vLLM, LLM Compressor
- Contribute VPTQ to llama.cpp/exllama.
- Contribute VPTQ to Edge devices deployment.
- TBC
- Yifei Liu (@lyf-00)
- Jicheng Wen (@wejoncy)
- Yang Wang (@YangWang92)
- We thank forJames Hensman for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
- We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
EMNLP 2024 Main
@inproceedings{ vptq,title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},author={Yifei Liu and Jicheng Wen and Yang Wang and Shengyu Ye and Li Lyna Zhang and Ting Cao and Cheng Li and Mao Yang},booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},year={2024},}
⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
This project welcomes contributions and suggestions. Most contributions require you to agree to aContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant usthe rights to use your contribution. For details, visithttps://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to providea CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructionsprovided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted theMicrosoft Open Source Code of Conduct.For more information see theCode of Conduct FAQ orcontactopencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsofttrademarks or logos is subject to and must followMicrosoft's Trademark & Brand Guidelines.Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.Any use of third-party trademarks or logos are subject to those third-party's policies.
About
VPTQ, A Flexible and Extreme low-bit quantization algorithm