Vahe1994/SpQRPublic

NotificationsYou must be signed in to change notification settings
Fork43
Star543

License

Apache-2.0 license

543 stars 43 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
data		data
inference_lib		inference_lib
lm-evaluation-harness		lm-evaluation-harness
tests		tests
.clang-format		.clang-format
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
bench_matvec_kernel.py		bench_matvec_kernel.py
convert_legacy_model_format.py		convert_legacy_model_format.py
convert_to_hf.py		convert_to_hf.py
datautils.py		datautils.py
inference_demo.py		inference_demo.py
lmeval.py		lmeval.py
main.py		main.py
modelutils.py		modelutils.py
pyproject.toml		pyproject.toml
quant_groups.py		quant_groups.py
requirements.txt		requirements.txt
script.sh		script.sh
spqr_config.py		spqr_config.py
spqr_engine.py		spqr_engine.py
weight_permutation.py		weight_permutation.py

Repository files navigation

SpQR model compression

Note: This repository contains quantization algorithm and the model evaluation code for SpQR method for LLM compression;The efficient inference code will be added soon.

It accompanies the research paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression" .

Installation

Packages

To run SpQR withfalcon make sure that you havetorch>=2.0.0 withCUDA support.

Install packages fromrequirements.txt:

pip install -r requirements.txt

Note: the results reported in the ArXiv paper where obtained using4.28.dev0 version oftransformers, commit id464d420775.

Loading / caching datasets and tokenizer

The script will require downloading and caching locally the relevant tokenizer and the datasets.They will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables.Seerelevant Datasets documentation section

Models

This repository is expected to work with models ofLLaMA,Falcon andOPT families so far.

Data

For quantization with SpQR its is recommended to use the subset of the data modelwas trained on. I.e. for quantization ofLLaMA models we recommend to use the subsetofRedPajama and forFalcon quantization -RefinedWeb.Both subsets are stored indata directory:

data/red_pajama_n=1024.pth
data/refined_web_n=128.pth

Note These subsets are already processed with the corresponding model tokenizer. Use for different model will lead tounexpected behavior.

ForOPT following GPTQ paper we recommend to usec4.

W&B logging

For the sake of convenience one can optionally log the data toWeights and Biases service (wandb).Runpip install wandb for W&B logging.Specify$WANDB_ENTITY,$WANDB_PROJECT,$WANDB_NAME environment variables prior to running experiments. use--wandb argument to enable logging

Launching

GPU and RAM requirements

This code was developed and tested using a single A100 GPU with 80GB GPU RAM. It may successfully run on GPUs with 32GB+ VRAM for perplexity evaluation of up toLLaMA-65B andFalcon-40B models.With--offload activations option, the model perplexity may be evaluated on machines with less VRAM: 24GB+ for Llama 65B and 6GB+ for Llama 7B.The perplexity testing code also requires RAM amount sufficient to hold uncompressed model weights (e.g. ~130GB for Llama65B) and testing datasets.ForLanguage Model Evaluation Harness evaluation one needs to have enough memory to load whole modelon one or several devices + activation tensors.

Model downloading

The code requires the LLaMA model to be downloaded in Huggingface format and saved locally. The scripts below assume that$TRANSFORMERS_CACHE variable points to the Huggingface Transformers cache folder.

Perplexity benchmarks:

This script compresses the model and then tests its performance in terms of perplexity using WikiText2, C4, and Penn Treebank datasets.

The command to launch the script should look like this:

export MODEL_PATH=<PATH_TO_MODEL_DIR>export DATASET=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>python main.py $MODEL_PATH $DATASET \    --wbits 4 \    --groupsize 16 \    --perchannel \    --qq_scale_bits 3 \    --qq_zero_bits 3 \    --qq_groupsize 16 \    --outlier_threshold=0.2 \    --permutation_order act_order \    --percdamp 1e0 \    --nsamples 128

The command above runs near-lossless compression as described in the article. Adjusting the above parameters allows for tighter compression with a slightly greater loss.

Note the launch arguments:

<PATH_TO_MODEL_DIR> - path to model folder, which containsconfig.json
one of [c4, ptb, wikitext2, pajama, refinedweb, none] -- name of dataset to use for compression, or path to an alternative preprocessed and tokenized dataset.
--wbits 3 -- number of bits for quantized weights representation
--groupsize 16 -- size of first-order groups for compression
--qq_groupsize 16 -- size of second-order (quantized) groups for compression
--qq_scale_bits 3 --qq_zero_bits 3 -- bit sizes for quantizing first order weights' scale and zeros.
--offload activations -- moves activations to RAM when not used. Reduces VRAM usage while slowing work by ~10%.runpython main.py --help for more details on command line arguments, including compression parameters.
--save --load -- path to save/load quantized model.

LM Evaluation Harness benchmark.

To perform zero-shot evaluation, we useLanguage Model Evaluation Harness framework with slight modifications. This repository contains a copy of LM Evaluation Harness repo from early 2023 inlm-eval-harness folder.

Installation

Before running the code make sure that you have all the requirements and dependencies oflm-eval-harness installed. To install them run:

pip install -r lm-evaluation-harness/requirements.txt

Execution

The main script launching the evaluation procedure islmeval.py .

Note. Current version of the script support only LLaMA/Falcon quantization. Therefore, set:

--model=hf-causal
--model_args pretrained=$MODEL_PATH where$MODEL_PATH has to be one of the LLaMA models

--quantization_args - list of comma separated arguments for quantizer. For details and optionsrefer tospqr_config.py.

Below is presented an example of benchmark launch.

export MODEL_PATH=<INSERT PATH_TO_MODEL_DIR>export DATASET=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>python lmeval.py \    --model hf-causal \    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \    --quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \    --batch_size 1

Performance and runtime notes:

For large models (LLaMA-30B, LLaMA-65B) specifymax_memory_per_gpu={value}GIB so that there are free 15-20GIB of GPU memory for each GPU to store activations for calibration.
offload_activations=True slightly reduces peak memory consumption
TypicallyLlaMA-30B requires 1-2 A100 GPUs with 80Gb of memory andLlaMA-65B requires 3 A100 with 80Gb each.
With enough spare GPU memory, one can raise batch size to accelerate evaluation process.

Inference

This repository also contains an efficient CUDA kernel implementation of theSpQR matvec. The fileinference_demo.py h orcontains a demo of this functionalityby running end-to-end model inference. Below is an example of how to launch it.

usage: inference_demo.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH] [--compressed_model_path COMPRESSED_MODEL_PATH] --execution_mode {0,1}options:  -h, --help            show thishelp message andexit  --pretrained_model_path PRETRAINED_MODEL_PATH                        Path to the model to the pretrained model  --compressed_model_path COMPRESSED_MODEL_PATH                        Path to the compressed .pt model  --execution_mode {0,1}                        Ifset to 0, will evaluate the dense pretrained model. Ifset to 1, will evaluate the spqr-quantized model

This script also reports the mean and median time of the forward() passes and the total inference execution time.

Pre-Requisites for Running the Conversion Scripts, Tests and Benchmarks

In order to run the benchmark and test suite you need to build the sources used by these scripts.You can do so by running the following command:

/bin/bash scripts/build.sh

which simply runs thesetup.py script.

Conversion From Legacy to Optimized SPQR Storage

After running SpQR which produces the tensors stored in int8, in order to run the efficient inference kernels,one must convert the tensors produces by SpQR (legacy tensors) into the optimized storage format used bythe cuda kernel. In order to do so, run the following script:

usage: convert_legacy_model_format.py [-h] --base_model BASE_MODEL --legacy_model_path LEGACY_MODEL_PATH [--sparse_strategy {csr,ptcsr,optimize_latency}] [--save_pt SAVE_PT] [--save_per_layer SAVE_PER_LAYER]options:  -h, --help            show thishelp message andexit  --base_model BASE_MODEL                        path or name of the unquantized model  --legacy_model_path LEGACY_MODEL_PATH                        path to legacy model  --sparse_strategy {csr,ptcsr,optimize_latency}                        Sparse strategy storage. Options: csr, ptcsr, auto. CSR - Compressed Sparse Rows PTCSR - Alternative storage format optimize_latency - Use the current GPU to determine the optimal storage format to reduce                        kernel latency  --save_pt SAVE_PT     Save the converted quantized .pt model here  --save_per_layer SAVE_PER_LAYER                        Save the converted quantized m

Hugginface Conversion

To convert a model into a Hugging Face compatible format, use convert_to_hf.py script:

usage: convert_to_hf.py [-h] [--model MODEL] [--config_path CONFIG_PATH] [--in_path_pt IN_PATH_PT] [--out_path OUT_PATH] [--save_safetensors] [--trust_remote_code] [--load_model] [--save_tokenizer]options:  -h, --help            show thishelp message andexit  --model MODEL         Path to the model to base config on, asinAutoConfig.from_pretrained()  --config_path CONFIG_PATH                        Path to the model to base config on, asinAutoConfig.from_pretrained()  --in_path_pt IN_PATH_PT                        Path of the checkpoint to convert  --out_path OUT_PATH   Path to save HF compatible checkpoint to  --save_safetensors    Whether to savein safetensors format  --trust_remote_code   Whether to trust remote code  --load_model          Whether to load model  --save_tokenizer      Whether to save tokenizer

Benchmarks (matvec kernel)

In order to run the matvec benchmark suite, one should run:

bench_spqr.py [-h] --tensor_path TENSOR_PATH [--ptcsr_path PTCSR_PATH] [--output_path OUTPUT_PATH]options:  -h, --help            show thishelp message andexit  --tensor_path TENSOR_PATH                        Path to folder containing the tensors of the formmodel_path/ 0/ tensor0 tensor1  --ptcsr_path PTCSR_PATH                        Path to folder containing the tensors of the formmodel_path/ 0/ tensor0 tensor1  --output_path OUTPUT_PATH                        Path to results*.csv file.

Make sure that the<tensor_path> and the optional<ptcsr_path. point to a folder containing quantized matrices produced by theconvert_legacy_model_format.py script.Use<cuda_device_id> to set the cuda device during benchmark. The script outputs the results in<results_output>.

Tests

In order to run the unittest, simply execute:

python3 tests/test.py

Citation

@misc{dettmers2023spqr,      title={SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression},       author={Tim Dettmers and Ruslan Svirschevski and Vage Egiazarian and Denis Kuznedelev and Elias Frantar and Saleh Ashkboos and Alexander Borzunov and Torsten Hoefler and Dan Alistarh},      year={2023},      eprint={2306.03078},      archivePrefix={arXiv},      primaryClass={cs.CL}}

About

No description, website, or topics provided.

Movatterモバイル変換

License

Vahe1994/SpQR

Folders and files

Latest commit

History

Repository files navigation

SpQR model compression

Installation

Packages

Loading / caching datasets and tokenizer

Models

Data

W&B logging

Launching

GPU and RAM requirements

Model downloading

Perplexity benchmarks:

LM Evaluation Harness benchmark.

Installation

Execution

Inference

Pre-Requisites for Running the Conversion Scripts, Tests and Benchmarks

Conversion From Legacy to Optimized SPQR Storage

Hugginface Conversion

Benchmarks (matvec kernel)

Tests

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors14

Uh oh!

Languages

Packages