huggingface/text-generation-inferencePublic

NotificationsYou must be signed in to change notification settings
Fork1.2k
Star10.3k

Large Language Model Text Generation Inference

License

Apache-2.0 license

10.3k stars 1.2k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,424 Commits
.github		.github
assets		assets
backends		backends
benchmark		benchmark
clients/python		clients/python
docs		docs
integration-tests		integration-tests
launcher		launcher
load_tests		load_tests
nix		nix
proto		proto
router		router
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.redocly.lint-ignore.yaml		.redocly.lint-ignore.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.nix		Dockerfile.nix
Dockerfile_amd		Dockerfile_amd
Dockerfile_gaudi		Dockerfile_gaudi
Dockerfile_intel		Dockerfile_intel
Dockerfile_llamacpp		Dockerfile_llamacpp
Dockerfile_trtllm		Dockerfile_trtllm
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
crate-hashes.json		crate-hashes.json
flake.lock		flake.lock
flake.nix		flake.nix
rust-toolchain.toml		rust-toolchain.toml
sagemaker-entrypoint.sh		sagemaker-entrypoint.sh
tgi-entrypoint.sh		tgi-entrypoint.sh
update_doc.py		update_doc.py

Repository files navigation

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production atHugging Faceto power Hugging Chat, the Inference API and Inference Endpoints.

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, andmore. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Messages API compatible with Open AI Chat Completion API
Optimized transformers code for inference usingFlash Attention andPaged Attention on the most popular architectures
Quantization with :
Safetensors weight loading
Watermarking withA Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details seetransformers.LogitsProcessor)
Stop sequences
Log probabilities
Speculation ~2x latency
Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see theQuick Tour. The easiest way of getting started is using the official Docker container:

model=HuggingFaceH4/zephyr-7b-beta# share a volume with the Docker container to avoid downloading weights every runvolume=$PWD/datadocker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data \    ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id$model

And then you can make requests like

curl 127.0.0.1:8080/generate_stream \    -X POST \    -d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \    -H'Content-Type: application/json'

You can also useTGI's Messages API to obtain Open AI Chat Completion API compatible responses.

curl localhost:8080/v1/chat/completions \    -X POST \    -d'{  "model": "tgi",  "messages": [    {      "role": "system",      "content": "You are a helpful assistant."    },    {      "role": "user",      "content": "What is deep learning?"    }  ],  "stream": true,  "max_tokens": 20}' \    -H'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install theNVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the--gpus all flag and add--disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in theSupported Hardware documentation. To use AMD GPUs, please usedocker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.4-rocm --model-id $model instead of the command above.

To see all options to serve your models (in thecode or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of thetext-generation-inference REST API using the/docs route.The Swagger UI is also available at:https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize theHF_TOKEN environment variable for configuring the token employed bytext-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

Go tohttps://huggingface.co/settings/tokens
Copy your CLI READ token
ExportHF_TOKEN=<your CLI READ token>

or with Docker:

model=meta-llama/Meta-Llama-3.1-8B-Instructvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every runtoken=<your cli READ token>docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v$volume:/data \    ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id$model

A note on Shared Memory (shm)

NCCL is a communication framework used byPyTorch to do distributed training/inference.text-generation-inference makesuse ofNCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of aNCCL group,NCCL might fall back to using the host memory ifpeer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add--shm-size 1g on the above command.

If you are runningtext-generation-inference insideKubernetes. You can also add Shared Memory to the container bycreating a volume with:

-name:shmemptyDir:medium:MemorysizeLimit:1Gi

and mounting it to/dev/shm.

Finally, you can also disable SHM sharing by using theNCCL_SHM_DISABLE=1 environment variable. However, note thatthis will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this featureby setting the address to an OTLP collector with the--otlp-endpoint argument. The default service name can beoverridden with the--otlp-service-name argument

Architecture

Detailed blogpost by Adyen on TGI inner workings:LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)

Local install

You can also opt to installtext-generation-inference locally.

First clone the repository and change directory into it:

git clone https://github.com/huggingface/text-generation-inferencecd text-generation-inference

Theninstall Rust and create a Python virtual environment with at leastPython 3.9, e.g. usingconda orpython venv:

curl --proto'=https' --tlsv1.2 -sSf https://sh.rustup.rs| sh#using condaconda create -n text-generation-inference python=3.11conda activate text-generation-inference#using python venvpython3 -m venv .venvsource .venv/bin/activate

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zipcurl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIPsudo unzip -o$PROTOC_ZIP -d /usr/local bin/protocsudo unzip -o$PROTOC_ZIP -d /usr/local'include/*'rm -f$PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install# Install repository and HF/transformer fork with CUDA kernelstext-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

Local install (Nix)

Another option is to installtext-generation-inference locally usingNix. Currently,we only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies canbe pulled from a binary cache, removing the need to build them locally.

First follow the instructions toinstall Cachix and enable the Hugging Face cache.Setting up the cache is important, otherwise Nix will build many of the dependencieslocally, which can take hours.

After that you can run TGI withnix run:

cd text-generation-inferencenix run --extra-experimental-features nix-command --extra-experimental-features flakes. -- --model-id meta-llama/Llama-3.1-8B-Instruct

Note: when you are using Nix on a non-NixOS system, you have tomake some symlinksto make the CUDA driver libraries visible to Nix packages.

For TGI development, you can use theimpure dev shell:

nix develop .#impure# Only needed the first time the devshell is started or after updating the protobuf.(cd servermkdir text_generation_server/pb||truepython -m grpc_tools.protoc -I../proto/v3 --python_out=text_generation_server/pb \       --grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/v3/generate.protofind text_generation_server/pb/ -type f -name"*.py" -print0 -exec sed -i -e's/^\(import.*pb2\)/from . \1/g' {}\;touch text_generation_server/pb/__init__.py)

All development dependencies (cargo, Python, Torch), etc. are available in thisdev shell.

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found inthis list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using theNF4 and FP4 data types from bitsandbytes. It can be enabled by providing--quantize bitsandbytes-nf4 or--quantize bitsandbytes-fp4 as a command line argument totext-generation-launcher.

Read more about quantization in theQuantization documentation.

Develop

make server-devmake router-dev

Testing

# pythonmake python-server-testsmake python-client-tests# or both server and client testsmake python-tests# rust cargo testsmake rust-tests# integration testsmake integration-tests