alokkulkarni/TensorRT-LLMPublic

forked fromNVIDIA/TensorRT-LLM

NotificationsYou must be signed in to change notification settings
Fork0
Star1

License

Apache-2.0 license

1 star 1.9k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
3rdparty		3rdparty
benchmarks		benchmarks
cpp		cpp
docker		docker
docs		docs
examples		examples
scripts		scripts
tensorrt_llm		tensorrt_llm
tests		tests
windows		windows
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements-dev-windows.txt		requirements-dev-windows.txt
requirements-dev.txt		requirements-dev.txt
requirements-windows.txt		requirements-windows.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Large Language Models

Architecture | Results | Examples | Documentation

TensorRT-LLM Overview

TensorRT-LLM provides users with an easy-to-use Python API to define LargeLanguage Models (LLMs) and buildTensorRT engines that containstate-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.TensorRT-LLM also contains components to create Python and C++ runtimes thatexecute those TensorRT engines. It also includes abackendfor integration with theNVIDIA Triton Inference Server;a production-quality system to serve LLMs. Models built with TensorRT-LLM canbe executed on a wide range of configurations going from a single GPU tomultiple nodes with multiple GPUs (usingTensor Parallelismand/orPipeline Parallelism).

The Python API of TensorRT-LLM is architectured to look similar to thePyTorch API. It provides users with afunctional module containing functions likeeinsum,softmax,matmul orview. Thelayersmodule bundles useful building blocks to assemble LLMs; like anAttentionblock, aMLP or the entireTransformer layer. Model-specific components,likeGPTAttention orBertAttention, can be found in themodels module.

TensorRT-LLM comes with several popular models pre-defined. They can easily bemodified and extended to fit custom needs. See below for a list of supportedmodels.

To maximize performance and reduce memory footprint, TensorRT-LLM allows themodels to be executed using different quantization modes (seeexamples/gpt for concrete examples). TensorRT-LLM supportsINT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) aswell as a complete implementation of theSmoothQuant technique.

For a more detailed presentation of the software architecture and the keyconcepts used in TensorRT-LLM, we recommend you to read the followingdocument.

Installation

For Windows installation, seeWindows/.

TensorRT-LLM must be built from source, instructions can be foundhere. An image of a Docker container withTensorRT-LLM and its Triton Inference Server Backend will be made availablesoon.

The remaining commands in that document must be executed from the TensorRT-LLMcontainer.

Quick Start

To create a TensorRT engine for an existing model, there are 3 steps:

Download pre-trained weights,
Build a fully-optimized engine of the model,
Deploy the engine.

The following sections show how to use TensorRT-LLM to run theBLOOM-560m model.

0. In the BLOOM folder

Inside the Docker container, you have to install the requirements:

pip install -r examples/bloom/requirements.txtgit lfs install

1. Download the model weights from HuggingFace

From the BLOOM example folder, you must download the weights of the model.

cd examples/bloomrm -rf ./bloom/560Mmkdir -p ./bloom/560M&& git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M

2. Build the engine

# Single GPU on BLOOM 560Mpythonbuild.py--model_dir ./bloom/560M/ \--dtypefloat16 \--use_gemm_pluginfloat16 \--use_gpt_attention_pluginfloat16 \--output_dir ./bloom/560M/trt_engines/fp16/1-gpu/

See the BLOOMexample for more details and options regarding thebuild.py script.

3. Run

Thesummarize.py script can be used to perform the summarization of articlesfrom the CNN Daily dataset:

pythonsummarize.py--test_trt_llm \--hf_model_location ./bloom/560M/ \--data_typefp16 \--engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/

More details about the script and how to run the BLOOM model can be found inthe examplefolder. Many moremodels than BLOOMare implemented in TensorRT-LLM. They can be found in theexamples directory.

Support Matrix

TensorRT-LLM optimizes the performance of a range of well-known models onNVIDIA GPUs. The following sections provide a list of supported GPUarchitectures as well as important features implemented in TensorRT-LLM.

Devices

TensorRT-LLM is rigorously tested on the following GPUs:

H100
L40S
A100/A30
V100 (experimental)

If a GPU is not listed above, it is important to note that TensorRT-LLM isexpected to work on GPUs based on the Volta, Turing, Ampere, Hopper and AdaLovelace architectures. Certain limitations may, however, apply.

Precision

Various numerical precisions are supported in TensorRT-LLM. The support forsome of those numerical features require specific architectures:

	FP32	FP16	BF16	FP8	INT8	INT4
Volta (SM70)	Y	Y	N	N	Y	Y
Turing (SM75)	Y	Y	N	N	Y	Y
Ampere (SM80, SM86)	Y	Y	Y	N	Y	Y
Ada-Lovelace (SM89)	Y	Y	Y	Y	Y	Y
Hopper (SM90)	Y	Y	Y	Y	Y	Y

In this release of TensorRT-LLM, the support for FP8 and quantized data types(INT8 or INT4) is not implemented for all the models. See theprecision document and theexamples folder for additional details.

Key Features

TensorRT-LLM contains examples that implement the following features.

Multi-head Attention(MHA)
Multi-query Attention (MQA)
Group-query Attention(GQA)
In-flight Batching
Paged KV Cache for the Attention
Tensor Parallelism
Pipeline Parallelism
INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
SmoothQuant
GPTQ
AWQ
FP8
Greedy-search
Beam-search
RoPE

In this release of TensorRT-LLM, some of the features are not enabled for allthe models listed in theexamples folder.

Models

The list of supported models is:

Performance

Please refer to theperformance page forperformance numbers. That page contains measured numbers for four variants ofpopular models (GPT-J, LLAMA-7B, LLAMA-70B, Falcon-180B), measured on the H100,L40S and A100 GPU(s).

Advanced Topics

Quantization

Thisdocument describes the differentquantization methods implemented in TensorRT-LLM and contains a support matrixfor the different models.

In-flight Batching

TensorRT-LLM supports in-flight batching of requests (also known as continuousbatching or iteration-level batching). It's atechnique that aims at reducing waittimes in queues, eliminating the need for padding requests and allowing forhigher GPU utilization.

Attention

TensorRT-LLM implements several variants of the Attention mechanism thatappears in most the Large Language Models. Thisdocument summarizes those implementations andhow they are optimized in TensorRT-LLM.

Graph Rewriting

TensorRT-LLM uses a declarative approach to define neural networks and containstechniques to optimize the underlying graph. For more details, please refer todoc

Benchmark

TensorRT-LLM providesC++ andPython tools to perform benchmarking. Note,however, that it is recommended to use the C++ version.

Troubleshooting

It's recommended to add options–shm-size=1g –ulimit memlock=-1 to thedocker or nvidia-docker run command. Otherwise you may see NCCL errors whenrunning multiple GPU inferences. Seehttps://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#errorsfor details.
When building models, memory-related issues such as

[09/23/2023-03:13:00] [TRT] [E] 9: GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types[09/23/2023-03:13:00] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types)

may happen. One possible solution is to reduce the amount of memory needed byreducing the maximum batch size, input and output lengths. Another option is toenable plugins, for example:--use_gpt_attention_plugin.

Release notes

TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.

Change Log

TensorRT-LLM v0.5.0 is the first public release.

Known Issues

Report Issues

You can use GitHub issues to report issues with TensorRT-LLM.

About

nvidia.github.io/TensorRT-LLM

Releases

No releases published

Packages

No packages published

Languages

C++98.6%
Python1.0%
Cuda0.4%
CMake0.0%
Smarty0.0%
Shell0.0%

Movatterモバイル変換

License

alokkulkarni/TensorRT-LLM

Folders and files

Latest commit

History

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Large Language Models

Table of Contents

TensorRT-LLM Overview

Installation

Quick Start

Support Matrix

Devices

Precision

Key Features

Models

Performance

Advanced Topics

Quantization

In-flight Batching

Attention

Graph Rewriting

Benchmark

Troubleshooting

Release notes

Change Log

Known Issues

Report Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages