Model Definition#
TensorRT-LLM has a Model Definition API that can be used to defineLarge Language Models. This API is built on top of the powerfulTensorRT Python APIto create graph representations of deep neural networks in TensorRT. To becomefamiliar with the core concepts of the TensorRT API, refer to theCore Conceptssection of the TensorRT documentation before proceeding further.
In TensorRT-LLM, thetensorrt_llm.Builder classcontains atensorrt.Builderobject. That instance is used in thetensorrt_llm.Builder.create_networkmethod to create an instance of thetensorrt.INetworkDefinitionclass. TheINetworkDefinition object can then be populated using the freefunctions defined in thetensorrt_llm.functional.
A simple example of such a free function istensorrt_llm.activation that inserts atensorrt.IActivationLayernode in the graph of the model:
# In tensorrt_llm.functional:defactivation(input:Tensor,act_type:trt.ActivationType)->Tensor:layer=default_trtnet().add_activation(input.trt_tensor,act_type)# default_trtnet() -> INetworkDefinitionreturn_create_tensor(layer.get_output(0),layer)
To make it even easier for users, a few of the most standard activationfunctions found in LLMs are derived from that function:
# In tensorrt_llm.functional:relu=partial(activation,act_type=trt.ActivationType.RELU)sigmoid=partial(activation,act_type=trt.ActivationType.SIGMOID)
Specialized activation functions can be used to assemble more advancedfunctions such as thesilu activation:
# In tensorrt_llm.functional:defsilu(input:Tensor)->Tensor:returninput*sigmoid(input)
When the TensorRT-LLM’s Model Definition API is utilized, a graph of the network isassembled. The graph can later be traversed or transformed using the graphtraversal API exposed by thetensorrt.ILayerclass. That graph will also be optimized by TensorRT during the compilation ofthe engine, as explained in the next section.
Compilation#
Once populated, the instance of thetensorrt.INetworkDefinition,can be compiled into an efficient engine by thetensorrt.BuilderIn TensorRT-LLM, it is done through thebuild_engine member function of thetensorrt_llm.Builder class that calls the[build_serialized_network](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Core/Builder.html#tensorrt.Builder.build_serialized_networkmethod of thetensorrt.Builderobject. That call, if everything works as expected, produces an instance of thetensorrt.IHostMemoryclass. That object is an optimized TensorRT engine that can be stored as abinary file.
TensorRT Compiler#
The TensorRT compiler can sweep through the graph to choose the best kernel for each operation and available GPU. Crucially, it can also identify patterns in the graph where multiple operations are good candidates for being fused into a single kernel. This reduces the required amount of memory movement and the overhead of launching multiple GPU kernels.
TensorRT also compiles the graph of operations into a singleCUDA Graph that can be launched all at one time, further reducing the kernel launch overhead.
The TensorRT compiler is extremely powerful for fusing layers and increasing execution speed, but there are some complex layer fusions—likeFlashAttention — that involve interleaving many operations together and which can’t be automatically discovered. For those, you can explicitly replace parts of the graph withplugins at compile time.
Model Engine#
The engine file contains the information that you need for executing the model, but LLM usage in practice requires much more than a single forward pass through the model. TensorRT-LLM includes a highly optimized C++ runtime for executing built LLM engines and managing processes like sampling tokens from the model output, managing the KV cache, and batching requests together.
You can use that runtime directly to execute the model locally, or you can use the TensorRT-LLM runtime backend for NVIDIA Triton Inference Server to serve the model for multiple users.
Weight Bindings#
TensorRT engines embed the network weights, that must be known for compilation.For that reason, the weights must be bound to parameters in the modeldefinition before callingtensorrt_llm.Builder.build_engine. It leads to code like:
# The Linear operator exposes two parameters (see tensorrt_llm/layers/linear.py):classLinear(Module):def__init__(self,...):self.weight=Parameter(shape=(self.out_features,self.in_features),dtype=dtype)self.bias=Parameter(shape=(self.out_features,),dtype=dtype)# The parameters are bound to the weights before compiling the model. See examples/models/core/gpt/weight.py:tensorrt_llm_gpt.layers[i].mlp.fc.weight.value=fromfile(...)tensorrt_llm_gpt.layers[i].mlp.fc.bias.value=fromfile(...)
Note that TensorRT can alsorefitengines to update the weights after compilation. This feature is available toTensorRT-LLM users through therefit_engine method in thetensorrt_llm.Builder class.
Pattern-Matching and Fusion#
One of the key steps performed by TensorRT when it compiles the network graphis the fusion of operations. Fusion is a well-known technique to improve theefficiency when executing LLMs. It helps reduce the amount of data transferredbetween the memory (DRAM) and the compute cores (CUDA cores as well as TensorCores located on theStreamingMultiprocessorsof a GPU). It also removes kernel launch overhead (each time a kernel islaunched on the GPU, there is a small additional CPU cost that is called thelaunch overhead). A classical example is the fusion of the activation functionwith the matrix multiplication (matmul) that usually precedes it in thenetwork.
In TensorRT-LLM, when defining the model, such a sequence can be written as:
c=tensorrt_llm.functional.matmul(a,b)c=tensorrt_llm.functional.relu(c)
During inference, if the above sequence is executed without fusion, thectensor has to be written to global memory at the end of thematmul, read fromthat same memory inrelu and written again afterrelu. If no otheroperation uses the intermediate values betweenmatmul andrelu, it issuboptimal. That is why, during compilation, TensorRT will identify thatpattern and automatically produce a GPU kernel that appliesrelu at the endofmatmul without an intermediate step through global memory. With thatoptimization, thec tensor is written only once (afterrelu) instead oftwice, and is not read between the two operations.
The process of identifying the sequences of operations that can be fused iscalledpattern-matching. TensorRT has a powerful pattern-matching algorithmthat can identify a lot of possible fusions. All the identified patterns areconverted into more efficient kernels by an advanced kernel compiler.
Plugins#
The number of possible fusions is almost infinite and some useful fusionsinvolve very advanced modifications of the graph. A well-known exampleis theFlash-Attention technique tooptimize theMultihead-Attention blockfound in many LLMs. Flash-Attention requires modifications to the arithmeticperformed in the sequenceBMM-Softmax-BMM (whereBMM stands for BatchedMatrix-Matrix product) and the interleaving of thefor-loops of the twobatched matrix products. That’s non-trivial and not necessarily somethingyou can expect a compiler to “discover” on its own (or it might require thesupport for apolyhedralmodel).
As a result, even if TensorRT has a powerful pattern-matching algorithm andsupports a lot of possible fusions, there is always the risk that it cannotidentify uncommon and/or very advanced patterns. To overcome that inevitablelimitation, TensorRT offers a powerful mechanism known asplugins.
The plugins are nodes inserted in the network graph definition that map to user-definedGPU kernels. TensorRT-LLM uses a number of such plugins. They can be found inthecpp/tensorrt_llm/plugins directory.
Plugins are written in C++ and follow a well-defined interface described in theExtending TensorRT with Custom Layerssection of the TensorRTDeveloper Guide.When executed within a TensorRT engine, plugins trigger the execution oftheir encapsulated GPU kernels. A fairly simple example of plugins is theQuantizeTensorPlugin thattriggers a CUDA kernel in theQuantizeTensorPlugin::enqueue member function:
// In cpp/tensorrt_llm/plugins/quantizeTensorPlugin/quantizeTensorPlugin.cpp:intQuantizeTensorPlugin::enqueue(...){if(inputDesc[0].type==DataType::kFLOAT){invokeQuantization<float>(...);}else{invokeQuantization<half>(...);}return0;}// In cpp/tensorrt_llm/kernels/quantization.cu:template<typenameT>voidinvokeQuantization(...){// The standard <<< >>> construct to launch CUDA kernelsquantizedKernel<<<grid,block,0,stream>>>(...);}
For more details on how TensorRT-LLM implements the GPT Attention operator, seetheMulti-head, Multi-query and Group-query Attention document.
Runtime#
TensorRT-LLM includes an API to implement Python and C++ runtimes. The role ofthe runtime components is to load the TensorRT engines and drive theirexecution. Typically, for an auto-regressive model like GPT, the runtime is incharge of loading the engine that implements both the processing of the inputsequence as well as the body of the generation loop. See theGPT C++Runtime document for details on the C++ Runtime.
Multi-GPU and Multi-Node Support#
Even if TensorRT is designed for single-GPU systems, TensorRT-LLM adds thesupport for systems with multiple GPUs and nodes. It is enabledusing TensorRT plugins that wrap communication primitives from theNCCL library as well as a customplugin that optimize the All-Reduce primitive in the presence of All-to-allconnections between GPUs (through NVSwitch in DGX systems).
The communication plugins can be found incpp/tensorrt_llm/plugins/ncclPluginand the multi-GPU functions are exposed in the TensorRT-LLM Model Definition APIas:
# In tensorrt_llm/functional.py:# Collectives.defallreduce(tensor:Tensor,group:List[int])->Tensordefallgather(tensor:Tensor,group:List[int],gather_dim:int=0)->Tensor# Point-to-point communication primitives.defsend(tensor:Tensor,tgt:int)->Tensordefrecv(tensor:Tensor,src:int)->Tensor
The multi-GPU support can be enabled through two different modes of modelparallelism: Tensor Parallelism and Pipeline Parallelism. The former modesplits the different layers of a model across the GPUs. Each GPU runs theentire network and synchronizes with its siblings when needed. The PipelineParallelism distributes the different layers to the GPUs. Each GPU runs asubset of the entire model and communications happen at the boundary of thosesubsets of layers. Tensor Parallelism usually leads to more balanced executionsbut requires more memory bandwidth between the GPUs. Pipeline Parallelismreduces the need for high-bandwidth communication but may incur load-balancingissues and may be less efficient in terms of GPU utilization.
Examples#
Here are examples of Llama 3.1 70B and Llama 3.1 405B showing how to perform multi-GPU and multi-node inference in TensorRT-LLM. The example of Llama 3.1 70B performs multi-GPU inference on a single node, while the example of Llama 3.1 405B performs multi-node inference.
Llama 3.1 70B#
The following sample commands build an engine for running the Llama 3.1 70B model with tensor parallelism (TP=4) using 4 GPUs on a single node.
folder_trt_llm=../TensorRT-LLMmodel_dir=Llama-3.1-70Bckpt_dir=ckpt_llama_3.1_70bengine_dir=engine_llama_3.1_70bdtype=bfloat16tp_size=4pp_size=1kv_cache_type=pagedmax_input_len=128max_output_len=128max_batch_size=4workers=$((tp_size*pp_size))python${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py\--output_dir${ckpt_dir}\--model_dir${model_dir}\--dtype${dtype}\--tp_size${tp_size}\--pp_size${pp_size}\--workers${workers}\--use_parallel_embeddingtrtllm-build\--output_dir${engine_dir}\--checkpoint_dir${ckpt_dir}\--gemm_plugin${dtype}\--gpt_attention_plugin${dtype}\--kv_cache_type${kv_cache_type}\--max_input_len${max_input_len}\--max_seq_len$((max_input_len+max_output_len))\--max_batch_size${max_batch_size}\--workers${workers}
The following sample commands perform inference using 4 GPUs on a single node by runningexamples/run.py.
input_text="Born in north-east France, Soyer trained as a"mpirun-n$((tp_size*pp_size))\python${folder_trt_llm}/examples/run.py\--engine_dir${engine_dir}\--tokenizer_dir${model_dir}\--input_text"${input_text}"\--max_output_len${max_output_len}
Llama 3.1 405B#
The following sample commands build an engine for running the Llama 3.1 405B model with tensor parallelism (TP=16) on 2 nodes that each have 8 GPUs. Although the model runs on multiple nodes, you can build the engine on a single node.
folder_trt_llm=../TensorRT-LLMmodel_dir=Llama-3.1-405Bckpt_dir=ckpt_llama_3.1_405bengine_dir=engine_llama_3.1_405bdtype=bfloat16tp_size=16pp_size=1kv_cache_type=pagedmax_input_len=128max_output_len=128max_batch_size=4workers=8python${folder_trt_llm}/examples/models/core/llama/convert_checkpoint.py\--output_dir${ckpt_dir}\--model_dir${model_dir}\--dtype${dtype}\--tp_size${tp_size}\--pp_size${pp_size}\--workers${workers}\--use_parallel_embeddingtrtllm-build\--output_dir${engine_dir}\--checkpoint_dir${ckpt_dir}\--gemm_plugin${dtype}\--gpt_attention_plugin${dtype}\--kv_cache_type${kv_cache_type}\--max_input_len${max_input_len}\--max_seq_len$((max_input_len+max_output_len))\--max_batch_size${max_batch_size}\--workers${workers}
The following sample script,launch_llama_3.1_405b.sh, shows how to perform inference with Slurm on 2 nodes that each have 8 GPUs. If you use a different workload management software, the key concern is to run theexamples/run.py command.
#!/bin/bash#SBATCH --account account#SBATCH --partition partition#SBATCH --job-name job-name#SBATCH --time 1:00:00#SBATCH --nodes 2folder_trt_llm=../TensorRT-LLMengine_dir=engine_llama_3.1_405bmodel_dir=Llama-3.1-405Bmax_output_len=128input_text="Born in north-east France, Soyer trained as a"srun\--ntasks-per-node8\--mpipmix\python${folder_trt_llm}/examples/run.py\--engine_dir${engine_dir}\--tokenizer_dir${model_dir}\--input_text"${input_text}"\--max_output_len${max_output_len}
You can perform inference by running the script on the Slurm cluster.
sbatchlaunch_llama_3.1_405b.sh