SekecD/BitNetPublic

forked frommicrosoft/BitNet

NotificationsYou must be signed in to change notification settings
Fork0
Star2

Official inference framework for 1-bit LLMs

License

MIT license

2 stars 1.8k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
3rdparty		3rdparty
assets		assets
docs		docs
include		include
media		media
preset_kernels		preset_kernels
src		src
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
run_inference.py		run_inference.py
setup_env.py		setup_env.py

Repository files navigation

bitnet.cpp

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that supportfast andlossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of1.37x to5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by55.4% to70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from2.37x to6.17x with energy reductions between71.9% to82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. More details will be provided soon.

The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.

Demo

A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:

demo.mp4

Timeline

10/17/2024 bitnet.cpp 1.0 released.
02/27/2024The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
10/17/2023BitNet: Scaling 1-bit Transformers for Large Language Models

Supported Models

bitnet.cpp supports a list of 1-bit models available onHugging Face, which are trained with research settings. We hope the release of bitnet.cpp can inspire more 1-bit LLMs trained in large-scale settings.

Model	Parameters	CPU	Kernel
Model	Parameters	CPU	I2_S	TL1	TL2
bitnet_b1_58-large	0.7B	x86	✔	✘	✔
bitnet_b1_58-large	0.7B	ARM	✔	✔	✘
bitnet_b1_58-3B	3.3B	x86	✘	✘	✔
bitnet_b1_58-3B	3.3B	ARM	✘	✔	✘
Llama3-8B-1.58-100B-tokens	8.0B	x86	✔	✘	✔
Llama3-8B-1.58-100B-tokens	8.0B	ARM	✔	✔	✘

Installation

Requirements

python>=3.9
cmake>=3.22
clang>=18
- For Windows users, installVisual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
  - Desktop-development with C++
  - C++-CMake Tools for Windows
  - Git for Windows
  - C++-Clang Compiler for Windows
  - MS-Build Support for LLVM-Toolset (clang)
- For Debian/Ubuntu users, you can download withAutomatic installation script
  bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
conda (highly recommend)

Build from source

Important

If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands

Clone the repo

git clone --recursive https://github.com/microsoft/BitNet.gitcd BitNet

Install the dependencies

# (Recommended) Create a new conda environmentconda create -n bitnet-cpp python=3.9conda activate bitnet-cpppip install -r requirements.txt

Build the project

# Download the model from Hugging Face, convert it to quantized gguf format, and build the projectpython setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s# Or you can manually download the model and run with local pathhuggingface-cli download HF1BitLLM/Llama3-8B-1.58-100B-tokens --local-dir models/Llama3-8B-1.58-100B-tokenspython setup_env.py -md models/Llama3-8B-1.58-100B-tokens -q i2_s

usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]                    [--use-pretuned]Setup the environment for running inferenceoptional arguments:  -h, --help            show this help message and exit  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens}                        Model used for inference  --model-dir MODEL_DIR, -md MODEL_DIR                        Directory to save/load the model  --log-dir LOG_DIR, -ld LOG_DIR                        Directory to save the logging info  --quant-type {i2_s,tl1}, -q {i2_s,tl1}                        Quantization type  --quant-embd          Quantize the embeddings to f16  --use-pretuned, -p    Use the pretuned kernel parameters

Usage

Basic usage

# Run inference with the quantized modelpython run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p"Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0# Output:# Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?# Answer: Mary is in the garden.

usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE]Run inferenceoptional arguments:  -h, --help            show this help message and exit  -m MODEL, --model MODEL                        Path to model file  -n N_PREDICT, --n-predict N_PREDICT                        Number of tokens to predict when generating text  -p PROMPT, --prompt PROMPT                        Prompt to generate text from  -t THREADS, --threads THREADS                        Number of threads to use  -c CTX_SIZE, --ctx-size CTX_SIZE                        Size of the prompt context  -temp TEMPERATURE, --temperature TEMPERATURE                        Temperature, a hyperparameter that controls the randomness of the generated text

Benchmark

We provide scripts to run the inference benchmark providing a model.

usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]     Setup the environment for running the inference     required arguments:    -m MODEL, --model MODEL                          Path to the model file.    optional arguments:    -h, --help                          Show this help message and exit.   -n N_TOKEN, --n-token N_TOKEN                          Number of generated tokens.   -p N_PROMPT, --n-prompt N_PROMPT                          Prompt to generate text from.   -t THREADS, --threads THREADS                          Number of threads to use.

Here's a brief explanation of each argument:

-m,--model: The path to the model file. This is a required argument that must be provided when running the script.
-n,--n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.
-p,--n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.
-t,--threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.
-h,--help: Show the help message and exit. Use this argument to display usage information.

For example:

python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4

This command would run the inference benchmark using the model located at/path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:

python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generatepython utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128

Acknowledgements

This project is based on thellama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. We also thankT-MAC team for the helpful discussion on the LUT method for low-bit LLM inference.

About

Official inference framework for 1-bit LLMs

Releases

No releases published

Packages

No packages published

Languages

C++59.6%
Python39.6%
Other0.8%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bitnet.cpp

Demo

Timeline

Supported Models

Installation

Requirements

Build from source

Usage

Basic usage

Benchmark

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

SekecD/BitNet

Folders and files

Latest commit

History

Repository files navigation

bitnet.cpp

Demo

Timeline

Supported Models

Installation

Requirements

Build from source

Usage

Basic usage

Benchmark

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages