microsoft/VPTQPublic

NotificationsYou must be signed in to change notification settings
Fork44
Star647

VPTQ, A Flexible and Extreme low-bit quantization algorithm

License

MIT license

647 stars 44 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
cmake		cmake
csrc		csrc
documents		documents
notebooks		notebooks
scripts		scripts
tests		tests
third_party		third_party
vptq		vptq
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
VPTQ_tech_report.pdf		VPTQ_tech_report.pdf
_config.yml		_config.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Efficient, Flexible and Compressing LLM in less than 2bits

Get Started |Technical Report

TL;DR

Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leveragesVector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit).VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.

Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit)
Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
Agile Quantization Inference: low decode overhead, best throughput, and TTFT

News

[2025-03-12] VPTQ now provides preliminary support for inference withDeepseek R1! With our quantized models, you can efficiently run Deepseek R1 on A100 GPUs.Tutorial Inference Code
[2025-02-27]aphrodite-engine is the first inference engine which supports our VPTQ, Thanks for @AlpinDale's contribute!
[2025-01-18] VPTQ v0.0.5 released, featuring cmake support and an enhanced build pipeline!
[2025-01-13] VPTQ is formly support by Transformers in its wheel package release sincev4.48.0.
[2024-12-20] 🚀VPTQ ❤️ Huggingface Transformers VPTQ support has been merged into Huggingface Transformers main branch! Check out thecommit and our Colab example:
[2024-12-15] 🌐 Open source community contributesMeta Llama 3.3 70B @ 1-4 bits models
[2024-11-01] 📦 VPTQ is now available onPyPI! You can install it easily using the command:pip install vptq.
[2024-10-28] ✨ VPTQ algorithm early-released atalgorithm branch, and checkout thetutorial.
[2024-10-22] 🌐 Open source community contributesMeta Llama 3.1 Nemotron 70B models, checkhow VPTQ counts 'r' on local GPU. We are continuing to work on quantizing the 4-6 bit versions. Please stay tuned!
[2024-10-21] 🌐 Open source community contributesMeta Llama 3.1 405B @ 3/4 bits models
[2024-10-18] 🌐 Open source community contributesMistral Large Instruct 2407 (123B) models
[2024-10-14] 🚀 Add earlyROCm support.
[2024-10-06] 🚀Try VPTQ on Google Colab.
[2024-10-05] 🚀Add free Huggingface Demo:Huggingface Demo
[2024-10-04] ✏️ Updated the VPTQ tech report and fixed typos.
[2024-09-20] 🌐 Inference code is now open-sourced on GitHub—join us and contribute!
[2024-09-20] 🎉 VPTQ paper has been accepted for the main track at EMNLP 2024.

Installation

Dependencies

CUDA toolkit
python 3.10+
torch >= 2.3.0
transformers >= 4.44.0
Accelerate >= 0.33.0
flash_attn >= 2.5.0
latest datasets
cmake >= 3.18.0

Install VPTQ on Your Machine

Recommendation: To save time on building the package, please install VPTQ directly from the latest release on PyPI:

pip install vptq

or from:

https://github.com/microsoft/VPTQ/releases

Install from PyPI

To install from PyPI, run:

pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation

Build from Source

If a release package is not available, you can build the package from the source code:

NOTE: Ensure Python dependencies and the CUDA toolkit are installed, and thatnvcc is available in yourPATH.

Build the wheel:

python setup.py build bdist_wheel# Install the built wheelpip install dist/vptq-{version}.whl# Replace {version} with the version number

Clean the build:
```
python setup.py clean
```

Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time

VPTQ is an ongoing project. If the open-source community is interested in optimizing and expanding VPTQ, please feel free to submit an issue or DM.

Evaluation

Models from Open Source Community

⚠️ The repository only provides a method of model quantization algorithm.

⚠️ The open-source communityVPTQ-community provides models based on the technical report and quantization algorithm.

⚠️ The repository cannot guarantee the performance of those models.

Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):

Model Naming Convention: The model's name includes thevector length$v$,codebook (lookup table) size, andresidual codebook size. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" is "Meta-Llama-3.1-70B-Instruct", where:
- Vector Length: 8
- Number of Centroids: 65536 (2^16)
- Number of Residual Centroids: 256 (2^8)
Equivalent Bitwidth Calculation:
- Index: log2(65536) = 16 / 8 = 2 bits
- Residual Index: log2(256) = 8 / 8 = 1 bit
- Total Bitwidth: 2 + 1 = 3 bits
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB
Note: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer toTech Report Appendix C.2.

Model Series	Collections	(Estimated) Bit per weight
DeeSseek R1	HF 🤗	2.x bits, reshard for 4 GPUs 2 bits, reshard for 4 GPUs 3 bits, resahrd for 4 GPUs,3 bits 2 bits
Llama 3.3 70B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)1.875 bits 1.625 bits
Llama 3.1 Nemotron 70B Instruct HF	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)1.875 bits 1.625 bits 1.5 bits
Llama 3.1 8B Instruct	HF 🤗	4 bits 3.5 bits 3 bits 2.3 bits
Llama 3.1 70B Instruct	HF 🤗	4 bits 3 bits 2.25 bits 2 bits (1)2 bits (2)1.93 bits 1.875 bits 1.75 bits
Llama 3.1 405B Instruct	HF 🤗	4 bits 3 bits 2 bits 1.875 bits 1.625 bits 1.5 bits (1)1.5 bits (2)1.43 bits 1.375 bits
Mistral Large Instruct 2407 (123B)	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)1.875 bits 1.75 bits 1.625 bits 1.5 bits
Qwen 2.5 7B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)2 bits (3)
Qwen 2.5 14B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)2 bits (3)
Qwen 2.5 32B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)2 bits (3)
Qwen 2.5 72B Instruct	HF 🤗	4 bits 3 bits 2.38 bits 2.25 bits (1)2.25 bits (2)2 bits (1)2 bits (2)1.94 bits
Reproduced from the tech report	HF 🤗	Results from the open source community for reference only, please use them responsibly.
Hessian and Inverse Hessian Matrix	HF 🤗	Collected from RedPajama-Data-1T-Sample, followingQuip#

Language Generation Example

To generate text using the pre-trained model, you can use the following code snippet:

The modelVPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft (~2 bit) is provided by open source community. The repository cannot guarantee the performance of those models.

python-mvptq--model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft--prompt="Explain: Do Not Go Gentle into That Good Night"

Terminal Chatbot Example

Launching a chatbot:Note that you must use a chat model for this to work

python-mvptq--model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft--chat

Huggingface Transformers API Example

Now, huggingface transformers main branch supports VPTQ:

#! pip install transformers>=4.48.0 -U#! pip install vptq -UfromtransformersimportAutoModelForCausalLM,AutoTokenizermodel_name="VPTQ-community/Meta-Llama-3.3-70B-Instruct-v16-k65536-65536-woft"# Load VPTQ-quantized model directly from HuggingFace Hubmodel=AutoModelForCausalLM.from_pretrained(model_name,device_map="auto")tokenizer=AutoTokenizer.from_pretrained(model_name)# Simple inferenceprompt="Explain: Do not go gentle into that good night."output=model.generate(**tokenizer(prompt,return_tensors="pt").to(model.device))print(tokenizer.decode(output[0],skip_special_tokens=True))

Python API Example from VPTQ package

Using the Python API from VPTQ package:

importvptqimporttransformersmodel_name="VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft"tokenizer=transformers.AutoTokenizer.from_pretrained(model_name)m=vptq.AutoModelForCausalLM.from_pretrained(model_name,device_map="auto")prompt="Explain: Do Not Go Gentle into That Good Night"out=m.generate(**tokenizer(prompt,return_tensors="pt").to("cuda"),max_new_tokens=100,pad_token_id=2)print(tokenizer.decode(out[0],skip_special_tokens=True))

Gradio Web App Example

An environment variable is available to control share link or not.export SHARE_LINK=1

python -m vptq.app

VPTQ Algorithm Early-released

VPTQ algorithm early-released atalgorithm branch, and checkout thetutorial.

Tech Report

VPTQ_tech_report

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

Read tech report atTech Report andarXiv Paper

Early Results from Tech Report

VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.

Model	bitwidth	W2↓	C4↓	AvgQA↑	tok/s↑	mem(GB)	cost/h↓
LLaMA-2 7B	2.02	6.13	8.07	58.2	39.9	2.28	2
	2.26	5.95	7.87	59.4	35.7	2.48	3.1
LLaMA-2 13B	2.02	5.32	7.15	62.4	26.9	4.03	3.2
	2.18	5.28	7.04	63.1	18.5	4.31	3.6
LLaMA-2 70B	2.07	3.93	5.72	68.6	9.7	19.54	19
	2.11	3.92	5.71	68.7	9.7	20.01	19

Road Map

Merge the quantization algorithm into the public repository.
Release onPypi
Improve the implementation of the inference kernel (e.g., CUDA, ROCm, Triton) and apply kernel fusion by combining dequantization (lookup) and Linear (GEMM) to enhance inference performance.
Support VLM models @YangWang92
Contribute VPTQ toHuggingface Transformers commit
Contribute VPTQ to vLLM, LLM Compressor
Contribute VPTQ to llama.cpp/exllama.
Contribute VPTQ to Edge devices deployment.
TBC

Project main members

Yifei Liu (@lyf-00)
Jicheng Wen (@wejoncy)
Ying Cao (@lcy-seso)
Yang Wang (@YangWang92)

Acknowledgement

We thank forJames Hensman for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.

Publication

EMNLP 2024 Main

@inproceedings{  vptq,title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},author={Yifei Liu and          Jicheng Wen and          Yang Wang and          Shengyu Ye and          Li Lyna Zhang and          Ting Cao and          Cheng Li and          Mao Yang},booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},year={2024},}

Star History

Limitation of VPTQ

⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
⚠️ Our tests are all based on English texts; other languages are not included in the current testing.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to aContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant usthe rights to use your contribution. For details, visithttps://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to providea CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructionsprovided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted theMicrosoft Open Source Code of Conduct.For more information see theCode of Conduct FAQ orcontactopencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsofttrademarks or logos is subject to and must followMicrosoft's Trademark & Brand Guidelines.Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.Any use of third-party trademarks or logos are subject to those third-party's policies.

About

VPTQ, A Flexible and Extreme low-bit quantization algorithm

Resources

Readme

License

MIT license

Code of conduct

Releases7

v0.0.5post1 Latest

Mar 6, 2025

+ 6 releases

Packages

No packages published

Movatterモバイル変換

License

microsoft/VPTQ

Folders and files

Latest commit

History

Repository files navigation

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

TL;DR

News

Installation

Dependencies

Install VPTQ on Your Machine

Install from PyPI

Build from Source

Evaluation

Models from Open Source Community

Language Generation Example

Terminal Chatbot Example

Huggingface Transformers API Example

Python API Example from VPTQ package

Gradio Web App Example

VPTQ Algorithm Early-released

Tech Report

Early Results from Tech Report

Road Map

Project main members

Acknowledgement

Publication

Star History

Limitation of VPTQ

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases7

Packages0

Uh oh!

Contributors9

Uh oh!

Languages

Packages