Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

License

NotificationsYou must be signed in to change notification settings

IST-DASLab/qmoe

Repository files navigation

This repository contains the full code of the paperQMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.

It is organized as follows:

  • datautils.py: utilities for dataset loading
  • gptq.py: robust batch-implementation of GPTQ
  • quant.py: quantization utilities
  • sub1.py: efficient inference of compressed models
  • sub1_cuda_kernel.cu: CUDA kernels
  • switch.py: the efficient QMoE compression framework
  • test.py: per-layer benchmarks and ideal compression rates

Dependencies

The project was developed with:

  • torch==2.0.0+cu117
  • transformers==4.28.0
  • datasets==2.10.1
  • CUDA 11.4 GPU drivers

CUDA kernels for compressed storage and inference can be installed via:

python setup_cuda.py install

Usage

Now follows a list of sample commands for running different experiments.

# BF16 baseline eval on C4 CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 # BF16 baseline eval on additional datasets CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval# ternary round to nearest baseline CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest # ternary compression with QMoE, saving the compressed model for later inferenceCUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL# 2-bit compression with QMoECUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000# test kernels and compute ideal compression rates CUDA_VISIBLE_DEVICES=0 python test.py# run per-layer benchmarksCUDA_VISIBLE_DEVICES=0 python test.py --benchmark# run eval of stored compressed modelCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 # run end-to-end benchmarkCUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128# run simulated end-to-end benchmark for BF16CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul

In general, you can passgoogle/switch-large-128 andgoogle/switch-c-2048 to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.

Always specifyCUDA_VISIBLE_DEVICES since some commands, likesub1.py, will otherwise attempt to use all available GPUs.

Compressed Models

Our models in compressed custom QMoE format are available on Hugging Face:base-128,large-128 andc-2048. To use them, clone the repository and then simply pass their path tosub1.py.

Cite

If you found this work useful, please consider citing:

@article{frantar-qmoe,  title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models}  author={Elias Frantar and Dan Alistarh},  year={2023},  journal={arXiv preprint, arxiv:2310.16795}}

About

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp