CoderPat/text-generation-inferencePublic

forked fromhuggingface/text-generation-inference

NotificationsYou must be signed in to change notification settings
Fork2
Star20

An Apache 2.0 fork of HuggingFace's Large Language Model Text Generation Inference

huggingface.github.io/text-generation-inference/

License

Apache-2.0 license

20 stars 1.2k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 391 Commits
.github		.github
assets		assets
benchmark		benchmark
central		central
chat-ui @ f65ca70		chat-ui @ f65ca70
clients/python		clients/python
docs		docs
integration-tests		integration-tests
launcher		launcher
load_tests		load_tests
notebooks		notebooks
proto		proto
router		router
server		server
setup_scripts		setup_scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml
sagemaker-entrypoint.sh		sagemaker-entrypoint.sh

Repository files navigation

LTI'sText Generation Inference Fork

A Rust, Python and gRPC server for text generation inference.

Forked fromHuggingFace'sText Generation Inference project (prior to its re-licensing), it's commercial-friendly and licensed under the Apache 2.0.

A note on this fork

This fork was created mainly due to two reasons:

Primarily, it allows us faster iteration and more flexibility, which is essential for our research uses. It also allows more control over development and documentation, crucial for our in-house uses at CMU.
While we understand the reasons behind the re-licensing, we don't want our (research) contributions to be locked behind a restrictive license. This fork will not sync with the upstream repository, and will be updated independently.

For contributors: If HuggingFace's upstream has a feature that you want to use, please open an issue first and discuss porting the functionality independently.Do not just copy the code over, as it will be rejected.

For LTI/cluster users

Getting started

If you are new to using this library, and as it has being used in your cluster, we recommend by starting with aclient-only installation, and using models launched by other users.

To start, theTGI_CENTRAL_ADDRESS needs to be set, so that the client can know which servers to connect to. For example, in the LTI cluster, run

echo"export TGI_CENTRAL_ADDRESS=babel-3-36:8765">>~/.bashrc# if using a single machine, use `0.0.0.0:8765` insteadsource~/.bashrc

To use the python client, install it with

cd clients/pythonpip install.

You can then query the API to list the models available in your cluster, and use models for inference.

fromtext_generationimportClient# get current models and pick the first onemodels=Client.list_from_central()model_name,model_addr=models[0]["name"],models[0]["address"]print(f"Using model{model_name} at{model_addr}")client=Client("http://"+model_addr)print(client.generate("What is Deep Learning?",max_new_tokens=20).generated_text)

Updating the environment

In general, you don't have to recreate the environment every time you want to update the library.To just update the library, run in the base directory (in a previously created environment)

export DIR=`pwd`OPENSSL_DIR=${DIR}/.openssl \OPENSSL_LIB_DIR=${DIR}/.openssl/lib \OPENSSL_INCLUDE_DIR=${DIR}/.openssl/include \BUILD_EXTENSIONS=false  \    make install

Running your own servers

If you are an LTI student using one of its cluster (or generally belong to an academic cluster that doesn't have docker installed), you can side-steps problems with installing system dependencies by using the(mini)conda package manager.

Then,from your base environment, run the install script:

bash setup_scripts/conda_server.sh

Note: Thistakes a really long time, up to 1.5-3 hour, sit back and realx while you wait for it.

Note: if you are running in a cluster withmodule installed, make sure you deactivate all modules before running the script.

This will create a conda environment with all the dependencies needed to run the model servers.

You should then be able to launch models with thetext-generation-launcher command, or by using one of the predefined MAKE rules

conda activate tgi-envmake run-llama2-vicuna-7b

Setting up a Central server

If you are setting this library for use in your group/cluster for the first time, you will need (or at least benefit) from setting up a central server.See the instructionsin the package folder.

Remember to set theTGI_CENTRAL_ADDRESS environment variable (ideally for all the users in your cluster) to the address of the central server.

Chat-UI

It is also possible to a simple webchat-ui to interact with models running in your server/cluster.This is a simple fork ofHuggingFace's Chat UI that communicates with the central controller to get the list of models available in the cluster, and then connects to the corresponding servers to generate text.

For example, in Babel, you can access a running Chat-UI web-server withport forwarding by running

ssh babel -L 8888:babel-3-36:4173

and going tolocalhost:8888 in your browser.

Check theREADME for more details.

Content below is from the original README.

Features

Serve the most popular Large Language Models with a simple launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference usingflash-attention andPaged Attention on the most popular architectures
Quantization withbitsandbytes andGPT-Q
Safetensors weight loading
Watermarking withA Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details seetransformers.LogitsProcessor)
Stop sequences
Log probabilities
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Get started

Docker

The easiest way of getting started is using the official Docker container:

model=tiiuae/falcon-7b-instructvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:0.9.4 --model-id$model

Note: To use GPUs, you need to install theNVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

To see all options to serve your models (in thecode or in the cli:

text-generation-launcher --help

You can then query the model using either the/generate or/generate_stream routes:

curl 127.0.0.1:8080/generate \    -X POST \    -d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \    -H'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \    -X POST \    -d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \    -H'Content-Type: application/json'

or from Python:

pip install text-generation

fromtext_generationimportClientclient=Client("http://127.0.0.1:8080")print(client.generate("What is Deep Learning?",max_new_tokens=20).generated_text)text=""forresponseinclient.generate_stream("What is Deep Learning?",max_new_tokens=20):ifnotresponse.token.special:text+=response.token.textprint(text)

API documentation

You can consult the OpenAPI documentation of thetext-generation-inference REST API using the/docs route.The Swagger UI is also available at:https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize theHUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed bytext-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

Go tohttps://huggingface.co/settings/tokens
Copy your cli READ token
ExportHUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

model=meta-llama/Llama-2-7b-chat-hfvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every runtoken=<your cli READ token>docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3 --model-id$model

A note on Shared Memory (shm)

NCCL is a communication framework used byPyTorch to do distributed training/inference.text-generation-inference makeuse ofNCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of aNCCL group,NCCL might fall back to using the host memory ifpeer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add--shm-size 1g on the above command.

If you are runningtext-generation-inference insideKubernetes. You can also add Shared Memory to the container bycreating a volume with:

-name:shmemptyDir:medium:MemorysizeLimit:1Gi

and mounting it to/dev/shm.

Finally, you can also disable SHM sharing by using theNCCL_SHM_DISABLE=1 environment variable. However, note thatthis will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this featureby setting the address to an OTLP collector with the--otlp-endpoint argument.

Local install

You can also opt to installtext-generation-inference locally.

Firstinstall Rust and create a Python virtual environment with at leastPython 3.9, e.g. usingconda:

curl --proto'=https' --tlsv1.2 -sSf https://sh.rustup.rs| shconda create -n text-generation-inference python=3.9conda activate text-generation-inference

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zipcurl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIPsudo unzip -o$PROTOC_ZIP -d /usr/local bin/protocsudo unzip -o$PROTOC_ZIP -d /usr/local'include/*'rm -f$PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install# Install repository and HF/transformer fork with CUDA kernelsmake run-falcon-7b-instruct

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can removethe kernels by using theDISABLE_CUSTOM_KERNELS=True environment variable.

Be aware that the official Docker image has them enabled by default.

Run Falcon

Run

make run-falcon-7b-instruct

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-falcon-7b-instruct-quantize

Develop

make server-devmake router-dev

Testing

# pythonmake python-server-testsmake python-client-tests# or both server and client testsmake python-tests# rust cargo testsmake rust-tests# integration testsmake integration-tests

Other supported hardware

TGI is also supported on the following AI hardware accelerators:

Habana first-gen Gaudi and Gaudi2: checkouthere how to serve models with TGI on Gaudi and Gaudi2 withOptimum Habana

About

An Apache 2.0 fork of HuggingFace's Large Language Model Text Generation Inference

huggingface.github.io/text-generation-inference/

Releases

No releases published

Packages

No packages published

Languages

Python67.9%
Rust23.1%
Cuda5.6%
Jupyter Notebook0.7%
Shell0.7%
Dockerfile0.7%
Other1.3%

Movatterモバイル変換

License

CoderPat/text-generation-inference

Folders and files

Latest commit

History

Repository files navigation

LTI'sText Generation Inference Fork

A note on this fork

For LTI/cluster users

Getting started

Updating the environment

Running your own servers

Setting up a Central server

Chat-UI

Table of contents

Features

Optimized architectures

Get started

Docker

API documentation

Using a private or gated model

A note on Shared Memory (shm)

Distributed Tracing

Local install

CUDA Kernels

Run Falcon

Run

Quantization

Develop

Testing

Other supported hardware

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages