VPTQ-community

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

OpenSourceRonin updateda model15 days ago

VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-256-woft

OpenSourceRonin updateda model15 days ago

VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woft

OpenSourceRonin updateda model15 days ago

VPTQ-community/Meta-Llama-3-70B-v16-k65536-65536-woft

View all activity

Organization Card

Community About org cards

Disclaimer:

VPTQ-community is a open source community to reproduced models on the paperVPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Modelsgithub

It is intended only for experimental purposes.

Users are responsible for any consequences arising from the use of this model.

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

TL;DR

Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leveragesVector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.

Better Accuracy on 1-2 bits
Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
Agile Quantization Inference: low decode overhead, best throughput, and TTFT

Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time

Tech Report

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

Read tech report atTech Report andarXiv Paper

Models from Open Source Community

⚠️ The repository only provides a method of model quantization algorithm.

⚠️ The open-source communityVPTQ-community provides models based on the technical report and quantization algorithm.

⚠️ The repository cannot guarantee the performance of those models.

Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):

Model Naming Convention: The model's name includes thevector length $v$,codebook (lookup table) size, andresidual codebook size. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" and "Meta-Llama-3.1-70B-Instruct", where:
- Vector Length: 8
- Number of Centroids: 65536 (2^16)
- Number of Residual Centroids: 256 (2^8)
Equivalent Bitwidth Calculation:
- Index: log2(65536) = 16 / 8 = 2 bits
- Residual Index: log2(256) = 8 / 8 = 1 bit
- Total Bitwidth: 2 + 1 = 3 bits
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB
Note: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer toTech Report Appendix C.2.

Model Series	Collections	(Estimated) Bit per weight
DeeSseek R1	HF 🤗	mixed 2-3 bits (reshard for 4 GPUs),2 bits (reshard for 4 GPUs),3 bits (reshard for 4 GPUs),2 bits,3 bits
Llama 3.3 70B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)1.875 bits 1.625 bits
Llama 3.1 Nemotron 70B Instruct HF	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)1.875 bits 1.625 bits 1.5 bits
Llama 3.1 8B Instruct	HF 🤗	4 bits 3.5 bits 3 bits 2.3 bits
Llama 3.1 70B Instruct	HF 🤗	4 bits 3 bits 2.25 bits 2 bits (1)2 bits (2)1.93 bits 1.875 bits 1.75 bits
Llama 3.1 405B Instruct	HF 🤗	4 bits 3 bits 2 bits 1.875 bits 1.625 bits 1.5 bits (1)1.5 bits (2)1.43 bits 1.375 bits
Mistral Large Instruct 2407 (123B)	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)1.875 bits 1.75 bits 1.625 bits 1.5 bits
Qwen 2.5 7B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)2 bits (3)
Qwen 2.5 14B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)2 bits (3)
Qwen 2.5 32B Instruct	HF 🤗	4 bits 3 bits 2 bits (1)2 bits (2)2 bits (3)
Qwen 2.5 72B Instruct	HF 🤗	4 bits 3 bits 2.38 bits 2.25 bits (1)2.25 bits (2)2 bits (1)2 bits (2)1.94 bits
Reproduced from the tech report	HF 🤗	Results from the open source community for reference only, please use them responsibly.
Hessian and Inverse Hessian Matrix	HF 🤗	Collected from RedPajama-Data-1T-Sample, followingQuip#

A Space Demo

A live-chatbot is created withVPTQ-LLM-2bit demo over VPTQ.

Vector Post Training Quantization Demo

models104

VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-256-woft

Updated15 days ago•18

VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woft

Updated15 days ago•44

VPTQ-community/Meta-Llama-3-70B-v16-k65536-65536-woft

Updated15 days ago•11

VPTQ-community/Meta-Llama-3-70B-v12-k4096-4096-woft

Updated15 days ago•11

VPTQ-community/Meta-Llama-3-70B-v12-k4096-4096-ft

Updated15 days ago•13

VPTQ-community/Meta-Llama-3-8B-v12-k4096-4096-woft

Updated15 days ago•15

VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-256-woft

Updated15 days ago•16

VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-65536-woft

Updated15 days ago•12

VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-65536-woft

Updated15 days ago•27

VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft

Updated15 days ago•33

datasets

None public yet

Movatterモバイル変換

VPTQ-community

AI & ML interests

Recent Activity

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

TL;DR

Tech Report

Models from Open Source Community

A Space Demo

Collections13

VPTQ-community/deepseek-r1_v_8_k_65536_mixed_mp4

VPTQ-community/deepseek-r1_v8_k_65536_mp4

VPTQ-community/deepseek-r1_v_8_k_65536_256_mp4

VPTQ-community/deepseek-r1_v_8_k_65536_256

VPTQ-community/Meta-Llama-3.3-70B-Instruct-v8-k65536-65536-woft

VPTQ-community/Meta-Llama-3.3-70B-Instruct-v8-k65536-256-woft

VPTQ-community/Meta-Llama-3.3-70B-Instruct-v8-k65536-0-woft

VPTQ-community/Meta-Llama-3.3-70B-Instruct-v16-k65536-65536-woft

spaces1

VPTQ Demo

models104

VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-256-woft

VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woft

VPTQ-community/Meta-Llama-3-70B-v16-k65536-65536-woft

VPTQ-community/Meta-Llama-3-70B-v12-k4096-4096-woft

VPTQ-community/Meta-Llama-3-70B-v12-k4096-4096-ft

VPTQ-community/Meta-Llama-3-8B-v12-k4096-4096-woft

VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-256-woft

VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-65536-woft

VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-65536-woft

VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft

datasets

Movatterモバイル変換

AI & ML interests

Recent Activity

Team members1

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

TL;DR

Tech Report

Models from Open Source Community

A Space Demo

Collections13

spaces1

VPTQ Demo

models104Sort: Recently updated

datasets

models104