Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

License

NotificationsYou must be signed in to change notification settings

turboderp/exllama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast andmemory-efficient on modern GPUs.

Disclaimer: The project is coming along, but it's still a work in progress!

Hardware requirements

I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, butanything Pascal or older with poor FP16 support isn't going to perform well.AutoGPTQ orGPTQ-for-LLaMaare better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currentlyhave no AMD devices to test or optimize on.

Dependencies

  • Python 3.9 or newer
  • torch tested on 2.0.1 and 2.1.0 (nightly) with cu118
  • safetensors 0.3.2
  • sentencepiece
  • ninja

Additionally, only for the web UI:

  • flask
  • waitress

Linux/WSL prerequisites

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118

Windows prerequisites

To run on Windows (without WSL):

  1. InstallMSVC 2022. You can choose to install the wholeVisual Studio 2022 IDE, or alternatively just theBuild Tools for Visual Studio 2022 package (make sureDesktop development with C++ is ticked in the installer), it doesn't really matter which.
  2. Install the appropriate version ofPyTorch, choosing one of the CUDAversions. I am developing on the nightly build, but the stable version (2.0.1) should also work.
  3. Install CUDA Toolkit, (11.7 and11.8 both seem to work, just make sure to match PyTorch'sCompute Platform version).
  4. For best performance, enable Hardware Accelerated GPU Scheduling.

How to

Clone repo, install dependencies, and run benchmark:

git clone https://github.com/turboderp/exllamacd exllamapip install -r requirements.txtpython test_benchmark_inference.py -d <path_to_model_files> -p -ppl

The CUDA extension is loaded at runtime so there's no need to install it separately. It will be compiled on the firstrun and cached to~/.cache/torch_extensions/ which could take a little while. If nothing happens at first, give ita minute to compile.

Chatbot example:

python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt

Python module

jllllll currently maintains an installable Python modulehere which may be moresuitable for integrating ExLlama with other projects

Web UI

I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will hauntyour dreams. But it sort of works, and it's kinda fun, especially multibot mode:

_screenshot.jpg

To run it:

pip install -r requirements-web.txtpython webui/app.py -d <path_to_model_files>

Note that sessions are stored in~/exllama_sessions/ by default. You can change that location with-sd if you want.

Docker

For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.

Requirements

It is recommended to run docker inrootless mode.

Build

The easiest way to build the docker image is using docker compose. First, set theMODEL_PATH andSESSIONS_PATH variables in the.env file to the actual directories on the host. Then run:

docker compose build

It is also possible to manually build the image:

docker build -t exllama-web .

NOTE: by default, the service inside the docker container is run by a non-root user. Hence, the ownership of bind-mounted directories (/data/model and/data/exllama_sessions in the defaultdocker-compose.yml file) is changed to this non-root user in the container entrypoint (entrypoint.sh). To disable this, setRUN_UID=0 in the.env file if usingdocker compose, or the following command if you manually build the image:

docker build -t exllama-web --build-arg RUN_UID=0 .

Run

Using docker compose:

docker compose up

The web UI can now be accessed on the host athttp://localhost:5000.

The configuration can be viewed indocker-compose.yml and changed by creating adocker-compose.override.yml file.

Run manually:

docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_to_session_dir>:/data/exllama_sessions --rm -it exllama-web --host 0.0.0.0:5000

Results so far

New implementation

ModelSizegrpszactSeq. len.VRAMPromptBestWorstPpl
Llama7B128no2,048 t5,194 MB13,918 t/s173 t/s140 t/s6.45
Llama13B128no2,048 t9,127 MB7,507 t/s102 t/s86 t/s5.60
Llama33B128no2,048 t20,795 MB2,959 t/s47 t/s40 t/s4.60
Llama33B128yes2,048 t20,795 MB2,784 t/s45 t/s37 t/s4.55
Llama33B32yes1,550 t121,486 MB2,636 t/s41 t/s37 t/s4.52
Koala13B128yes2,048 t9,127 MB5,529 t/s93 t/s79 t/s6.73
WizardLM33B-yes2,048 t20,199 MB2,313 t/s47 t/s40 t/s5.75
OpenLlama3B128yes2,048 t3,128 MB16,419 t/s226 t/s170 t/s7.81

1 Can not achieve full sequence length without OoM

All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.

"Prompt" speed is inference over the sequence length listed minus 128 tokens."Worst" is the average speed forthe last 128 tokens of the full context (worst case) and"Best" lists the speed for the first 128 tokens in anempty sequence (best case.)

VRAM usage is as reported by PyTorch and does not include PyTorch's own overhead (CUDA kernels,internal buffers etc.) This is somewhat unpredictable anyway. Best bet is to just optimize VRAM usage by the model,probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop environment and all of Torch'sinternals.

Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample fromWikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llamamodels to one another.

Dual GPU results

The following benchmarks are from a 4090 + 3090-Ti with-gs 17.2,24:

ModelSizegroupsizeactSeq. len.VRAMPromptBestWorstPpl
Llama65B128yes2,048 t39,804 MB1,109 t/s20 t/s18 t/s4.20
Llama65B32yes2,048 t43,424 MB1,037 t/s17 t/s16 t/s4.11
Llama-270B128yes2,048 t40,680 MB914 t/s17 t/s14 t/s4.15
Llama-270B32yes2,048 t36,815 MB874 t/s15 t/s12 t/s4.10

Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their differentpretraining datasets.

Todo

Moved the todo listhere.

Compatibility

Here is a list of models confirmed to be working right now.

Recent updates

2023-01-09: Added rope_theta parameter for (at least partial) CodeLlama support. If you were using alpha = 97or similar, you would no longer need that for CodeLlama models. Still stuff to sort out regarding the extendedvocabulary.

2023-08-09: Added support for sharded models.config.model_path now accepts either a filename or a list offilenames.model_init() will detect multiple .safetensors files if given a model directory. Note the change in thevarious examples:model_path = glob.glob(st_pattern)[0] becomes simplymodel_path = glob.glob(st_pattern). Alsothere's a little script inutil/shard.py to split large .safetensors files. It also produces an index.json file forthe sharded model, just for completeness, although ExLlama doesn't need it to read the shards. Note that thesafetensors dependency was bumped to version 0.3.2.

2023-08-12: Preliminary, initial and tentative release ofExLlamaV2.It doesn't do all the things that ExLlamaV1 does, yet, but it's better at what it does do. So check it out!

About

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

    Packages

    No packages published

    [8]ページ先頭

    ©2009-2025 Movatter.jp