- Notifications
You must be signed in to change notification settings - Fork22
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
License
IST-DASLab/qmoe
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the full code of the paperQMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.
It is organized as follows:
datautils.py
: utilities for dataset loadinggptq.py
: robust batch-implementation of GPTQquant.py
: quantization utilitiessub1.py
: efficient inference of compressed modelssub1_cuda_kernel.cu
: CUDA kernelsswitch.py
: the efficient QMoE compression frameworktest.py
: per-layer benchmarks and ideal compression rates
The project was developed with:
torch==2.0.0+cu117
transformers==4.28.0
datasets==2.10.1
- CUDA 11.4 GPU drivers
CUDA kernels for compressed storage and inference can be installed via:
python setup_cuda.py install
Now follows a list of sample commands for running different experiments.
# BF16 baseline eval on C4 CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 # BF16 baseline eval on additional datasets CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval# ternary round to nearest baseline CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest # ternary compression with QMoE, saving the compressed model for later inferenceCUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL# 2-bit compression with QMoECUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000# test kernels and compute ideal compression rates CUDA_VISIBLE_DEVICES=0 python test.py# run per-layer benchmarksCUDA_VISIBLE_DEVICES=0 python test.py --benchmark# run eval of stored compressed modelCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 # run end-to-end benchmarkCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128# run simulated end-to-end benchmark for BF16CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul
In general, you can passgoogle/switch-large-128
andgoogle/switch-c-2048
to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.
Always specifyCUDA_VISIBLE_DEVICES
since some commands, likesub1.py
, will otherwise attempt to use all available GPUs.
Our models in compressed custom QMoE format are available on Hugging Face:base-128,large-128 andc-2048. To use them, clone the repository and then simply pass their path tosub1.py
.
If you found this work useful, please consider citing:
@article{frantar-qmoe, title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models} author={Elias Frantar and Dan Alistarh}, year={2023}, journal={arXiv preprint, arxiv:2310.16795}}