IST-DASLab/qmoePublic

NotificationsYou must be signed in to change notification settings
Fork22
Star277

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

License

Apache-2.0 license

277 stars 22 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
datautils.py		datautils.py
gptq.py		gptq.py
quant.py		quant.py
setup_cuda.py		setup_cuda.py
sub1.py		sub1.py
sub1_cuda.cpp		sub1_cuda.cpp
sub1_cuda_kernel.cu		sub1_cuda_kernel.cu
switch.py		switch.py
test.py		test.py

Repository files navigation

QMoE

This repository contains the full code of the paperQMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.

It is organized as follows:

datautils.py: utilities for dataset loading
gptq.py: robust batch-implementation of GPTQ
quant.py: quantization utilities
sub1.py: efficient inference of compressed models
sub1_cuda_kernel.cu: CUDA kernels
switch.py: the efficient QMoE compression framework
test.py: per-layer benchmarks and ideal compression rates

Dependencies

The project was developed with:

torch==2.0.0+cu117
transformers==4.28.0
datasets==2.10.1
CUDA 11.4 GPU drivers

CUDA kernels for compressed storage and inference can be installed via:

python setup_cuda.py install

Usage

Now follows a list of sample commands for running different experiments.

# BF16 baseline eval on C4 CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 # BF16 baseline eval on additional datasets CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval# ternary round to nearest baseline CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest # ternary compression with QMoE, saving the compressed model for later inferenceCUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL# 2-bit compression with QMoECUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000# test kernels and compute ideal compression rates CUDA_VISIBLE_DEVICES=0 python test.py# run per-layer benchmarksCUDA_VISIBLE_DEVICES=0 python test.py --benchmark# run eval of stored compressed modelCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 # run end-to-end benchmarkCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128# run simulated end-to-end benchmark for BF16CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul

In general, you can passgoogle/switch-large-128 andgoogle/switch-c-2048 to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.

Always specifyCUDA_VISIBLE_DEVICES since some commands, likesub1.py, will otherwise attempt to use all available GPUs.

Compressed Models

Our models in compressed custom QMoE format are available on Hugging Face:base-128,large-128 andc-2048. To use them, clone the repository and then simply pass their path tosub1.py.

Cite

If you found this work useful, please consider citing:

@article{frantar-qmoe,  title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models}  author={Elias Frantar and Dan Alistarh},  year={2023},  journal={arXiv preprint, arxiv:2310.16795}}

About

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

arxiv.org/abs/2310.16795

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

QMoE

Dependencies

Usage

Compressed Models

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Languages

Movatterモバイル変換

License

IST-DASLab/qmoe

Folders and files

Latest commit

History

Repository files navigation

QMoE

Dependencies

Usage

Compressed Models

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Languages

Packages