IST-DASLab/HALOPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star17

HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation ofhttps://arxiv.org/abs/2501.02625

License

MIT license

17 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
gemm-fp8 @ febb16f		gemm-fp8 @ febb16f
gemm-int8 @ ea01ba2		gemm-int8 @ ea01ba2
peft		peft
qllmt		qllmt
scripts		scripts
tests		tests
third-party		third-party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
checkpointer.py		checkpointer.py
install.sh		install.sh
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Repository files navigation

😇 HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs 🚀

HALO is a novel quantization-aware training method for fine-tuning Large Language Models (LLMs) with low-precision matrix multiplications. It integratesHadamard transformations to mitigate outliers, enabling accurateINT8 andFP6 fine-tuning while maintaining compute efficiency. HALO achieves up to1.41× speedup over full-precision fine-tuning while preserving accuracy, supporting both full and parameter-efficient fine-tuning (PEFT).

HALO is implemented with efficient CUDA kernels and integrates seamlessly withFully Sharded Data Parallel (FSDP) for low-precision communication, making it ideal for large-scale distributed training. 💡

Installation 🛠️

First, start by cloning the repository with its submodules:

git clone --recurse-submodules https://github.com/IST-DASLab/HALO.git

or if it you have already cloned the repository, you can update the submodules with:

git submodule update --init --recursive

Create a new environment (python=3.10 is tested). Our code currently supports CUDA >=12.4, but using older CUDA versions should be possible by disabling some of the CUDA kernels.

# Create an environment, with whatever method you wantconda create -n halo python=3.10conda activate halo

Then run the following commands in order:

# Install the requirementssource install.sh

Training 👨‍🏫

To fine-tune a Llama-3-8B model, you can run:

cd scriptsCUDA_VISIBLE_DEVICES=0,1,2,3 bash train_halo.sh DATASET=<dataset> LR=<lr> KERNEL_TYPE=<kernel_type>

For the dataset and lr you can try the following combinations: (sql, 3e-5), (viggo, 4e-5), (gsm8k, 6e-6). Regarding the kernel type, you can choose any of the following:

base: this runs the base BF16 experiment, with HALO disabled.
halo0_fp8: runs our Halo level 0 with FP8 precision.
halo2_int8: runs our Halo level 2 with INT8 precision.

You can add_qfsdp to enable HQ-FSDP, for example:halo0_fp8_qfsdp. Other combinations of precision and HALO levels also work, e.g.,halo1_int8_qfsdp.

Benchmarks 📊

The benchmark files are located in thetests directory:

cd tests

Linear Module

You can run the single layer benchmarks using the following command:

CUDA_VISIBLE_DEVICES=0 python linear_module_benchmark.py --kernels base switchback jetfire halo2_int8 halo1_fp8 halo0_fp8 halo1_fp8

Per-Block Benchmarks

To run the single-gpu block-level benchmarks, run:

CUDA_VISIBLE_DEVICES=0 python benchmark_llama3_halo.py --num_blocks 3 --kernels base haloi_int8 haloi_fp8 halo0_fp8 halo1_fp8 halo2_int8

Herehaloi corresponds to the Ideal kernels in the paper.

For multi-gpu INT8 benchmarks, run:

NCCL_NTHREADS=64 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 benchmark_llama3_halo.py --fsdp --num_blocks 3 --kernels base haloi_int8 haloi_int8_qfsdp halo2_int8 halo2_int8_qfsdp

and for FP8:

NCCL_NTHREADS=64 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 benchmark_llama3_halo.py --fsdp --num_blocks 3 --kernels base haloi_fp8 haloi_fp8_qfsdp halo0_fp8 halo0_fp8_qfsdp halo1_fp8 halo1_fp8_qfsdp

Note thatNCCL_NTHREADS=64 is tuned for RTX 4090. For newer GPUs, you may use the default value without setting it.

Citation 📖

If you use HALO in your research, please cite our paper:

@article{halo2025,title={HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs},author={Saleh Ashkboos and Mahdi Nikdan and Soroush Tabesh and Roberto L. Castro and Torsten Hoefler and Dan Alistarh},year={2025},eprint={2501.02625},archivePrefix={arXiv},primaryClass={cs.LG},url={https://arxiv.org/abs/2501.02625}, }

About

HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation ofhttps://arxiv.org/abs/2501.02625

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

😇 HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs 🚀

Installation 🛠️

Training 👨‍🏫

Benchmarks 📊

Linear Module

Per-Block Benchmarks

Citation 📖

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

IST-DASLab/HALO

Folders and files

Latest commit

History

Repository files navigation

😇 HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs 🚀

Installation 🛠️

Training 👨‍🏫

Benchmarks 📊

Linear Module

Per-Block Benchmarks

Citation 📖

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages