shadowpa0327/PaluPublic

NotificationsYou must be signed in to change notification settings
Fork6
Star125

[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection

License

MIT license

125 stars 6 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
3rdparty		3rdparty
img		img
kernel		kernel
longbench_utils		longbench_utils
palu		palu
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
compress.py		compress.py
requirements.txt		requirements.txt
run_latency_attention.py		run_latency_attention.py
run_latency_kernel.py		run_latency_kernel.py
run_lm_eval.py		run_lm_eval.py
run_long_bench.py		run_long_bench.py
run_ppl_eval.py		run_ppl_eval.py
utils.py		utils.py

Repository files navigation

Palu: Compression KV-Cache with Low-Rank Decomposition

[Paper]

Updates

[2024.11.03]: We updated ourarXiv with latency evaluation on quantization integration. Check it out!
[2024.08.01]:🚀 Palu ver. 1 is released.

TL;DR

Palu is a KV-Cache compression framework that utilizes low-rank projection to compress the hidden dimension of KV-Cache, thereby reducing memory footprint and increasing speed.

Abstract

Palu is a pioneer KV-Cache compression framework that reduce the hidden dimenssion of KV-Cache via low-rank projection.Different fromMLA in DeepSeek-V2 that requires a large-scale training from scratch, Palu works with existing LLMs such as Llama3, Mistral, in a post-training manner.To achieve this, Palu decomposes the linear layers into low-rank matrices, caches the smaller intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) matrix fusion for quantization friendliness enhancements, and (4) co-designed GPU kernels.

Our extensive experiments with popular LLMs show that Palu can compress KV-Cache by more than 91.25% while maintaining a significantly better accuracy (up to 1.19 lower perplexity) than state-of-the-art KV-Cache quantization methods at a similar or even higher memory usage. For more details, please refer to ourpaper.

Todo Lists

Upgradetransformers>=4.43.3, for Llama3.1 support
Update reconstruction kernel, with quantization integrated.
Support FlashAttention or FlashInfer to enhance competatiblity

Installation

Clone the repository (Make sure you have Git, Conda installed on your system)

git clone --recurse-submodules https://github.com/shadowpa0327/Palu.gitcd Palu

Prepare environment

conda create -n Palu python=3.10conda activate Palupip install -r requirements.txt

Install 3rdparty libraries

pip install -e 3rdparty/lm-evaluation-harnesspip install -e 3rdparty/fast-hadamard-transform

Usage

Rank Search and Compression

We provide a scriptcompress.py to perform the rank search and low-rank decomposition to generate the low-rank projection matrices for compressing KV-Cache. Here, we perform the decomposition with proposedG-LRD methods with group size equal to 4 as an example.

python compress.py \--model_id=/Path/To/Pretrained/Model \--calib_dataset wikitext2 \--param_ratio_target 0.7 \--search_method fisher_uniform \--head_group_size 4 \--dump_huggingface_model \--use_cache

After executing the above command, a compressed models with decomposed low-rank projection matrices will be dumped into the{MODEL_NAME}-ratio-{TARGET_RATIO}_gs-{GROUP_SIZE}-{SEARCH_METHOD}-{DECOMPOSE_METHODS} directory. Here, the dumped models is stored via the huggingface transformers format.

Evaluation

With the compressed model dumped, we can evaluate the performance of the compressed model on the various tasks. We provide the scripts for evaluating the perplexity, zero-shot evaluation, and LongBench. By default, we will keep the compressed KV-Cache in fp16.

Perplexity

To evaluate the perplexity of the compressed model on thewikitext2 dataset with sequence length 2048, we can use theppl_eval.py script.

python run_ppl_eval.py \--model_name_or_path /Path/To/Palu/Model \--datasets wikitext2 \--seqlen 2048

To reproduce the evalaution ofc4 perplexity, simply change thedatasets argument toc4.

To evaluate the performance of quantization being integrated, please pass--lt_bit {num_bits} and--lr_haramard in the arguments to enable our low-rank aware quantization.For example, to evaluate the Palu with 3-bit low-rank aware quantization, please run:

python run_ppl_eval.py \--model_name_or_path /Path/To/Palu/Model \--datasets wikitext2 \--seqlen 4096 \--lt_bits 3 \--lt_hadamard

Note:run_ppl_eval.py does not support multi-gpu evaluation. If your machine has multiple GPUs, please setCUDA_VISIBLE_DEVICES to the desired GPU id.

Zero-shot Evaluation

To run zero-shot evaluations, please use therun_lm_eval.py script, which implements a wrapper around thelm-evaluation-harness library.

Before we start, please make sure thelm-eval==0.4.2 library is installed.

To reproduce the results in our paper, simply execute:

CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py --model_name_or_path /Path/To/Palu/Model \--tasks"openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"

Long-Bench

We also provide a script to run the long-bench evaluation on the compressed model.

CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \--model_name_or_path /Path/To/Palu/Model

The scrips will evaluate on "triviaqa", "qasper", "trec", "samsum", "lcc", "repobench-p", "qmsum" and "multi_news" datasets by default.User may also leverage the--datasets argument to specify the tasks to evaluate. For example, add--datasets "triviaqa,qasper" to evaluate on "triviaqa" and "qasper" datasets only.

Latency Evaluation

Attention Module

We provide a script to evaluate the compressed attention module latency under different settings. Below is an example demonstrating how to use this script to evaluate the latency of Palu attention module.

CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \    --rank_k 1024 --rank_v 3072 --group_size 4 \    --prompt_len 65536 --palu

The above command will run the latency evaluation with the following arguments:

--rank_k: Set the rank of the key matrix.
--rank_v: Set the rank of the value matrix.
--group_size: Set the group size, which is used in the low-rank decomposition.
--prompt_len: Set the prompt length.
--palu: Enable Palu compression technique.

Reconstruction Kernel

We also provide a script to evaluate our reconstruction kernel latency. Below is an example demonstrating how to use this script to evaluate the latency of Palu reconstruction kernel.

CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \    --total_rank 1024  --group_size 4

Reference

If you find this work useful, please consider citing our paper:

@misc{chang2024palucompressingkvcachelowrank,      title={Palu: Compressing KV-Cache with Low-Rank Projection},       author={Chi-Chih Chang and Wei-Cheng Lin and Chien-Yu Lin and Chong-Yan Chen and Yu-Fang Hu and Pei-Shuo Wang and Ning-Chi Huang and Luis Ceze and Kai-Chiang Wu},      year={2024},      eprint={2407.21118},      archivePrefix={arXiv},      primaryClass={cs.AI},      url={https://arxiv.org/abs/2407.21118}, }

About

[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection

Languages

Python100.0%

Movatterモバイル変換

License

shadowpa0327/Palu

Folders and files

Latest commit

History

Repository files navigation

Palu: Compression KV-Cache with Low-Rank Decomposition

Updates

TL;DR

Abstract

Todo Lists

Installation

Usage

Rank Search and Compression

Evaluation

Perplexity

Zero-shot Evaluation

Long-Bench

Latency Evaluation

Attention Module

Reconstruction Kernel

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors5

Languages

Packages