- Notifications
You must be signed in to change notification settings - Fork22
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
License
IST-DASLab/qmoe
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the full code of the paperQMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.
It is organized as follows:
datautils.py
: utilities for dataset loadinggptq.py
: robust batch-implementation of GPTQquant.py
: quantization utilitiessub1.py
: efficient inference of compressed modelssub1_cuda_kernel.cu
: CUDA kernelsswitch.py
: the efficient QMoE compression frameworktest.py
: per-layer benchmarks and ideal compression rates
The project was developed with:
torch==2.0.0+cu117
transformers==4.28.0
datasets==2.10.1
- CUDA 11.4 GPU drivers
CUDA kernels for compressed storage and inference can be installed via:
python setup_cuda.py install
Now follows a list of sample commands for running different experiments.
# BF16 baseline eval on C4 CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 # BF16 baseline eval on additional datasets CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval# ternary round to nearest baseline CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest # ternary compression with QMoE, saving the compressed model for later inferenceCUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL# 2-bit compression with QMoECUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000# test kernels and compute ideal compression rates CUDA_VISIBLE_DEVICES=0 python test.py# run per-layer benchmarksCUDA_VISIBLE_DEVICES=0 python test.py --benchmark# run eval of stored compressed modelCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 # run end-to-end benchmarkCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128# run simulated end-to-end benchmark for BF16CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul
In general, you can passgoogle/switch-large-128
andgoogle/switch-c-2048
to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.
Always specifyCUDA_VISIBLE_DEVICES
since some commands, likesub1.py
, will otherwise attempt to use all available GPUs.
Our models in compressed custom QMoE format are available on Hugging Face:base-128,large-128 andc-2048. To use them, clone the repository and then simply pass their path tosub1.py
.
If you found this work useful, please consider citing:
@article{frantar-qmoe, title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models} author={Elias Frantar and Dan Alistarh}, year={2023}, journal={arXiv preprint, arxiv:2310.16795}}
About
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.