LLM Common Customizations#
Quantization#
TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in theLLM instance. For example, to perform an Int4 AWQ quantization, the following code triggers the model quantization. Please refer to complete list ofsupported flags and acceptable values.
fromtensorrt_llm.llmapiimportQuantConfig,QuantAlgoquant_config=QuantConfig(quant_algo=QuantAlgo.W4A16_AWQ)llm=LLM(<model-dir>,quant_config=quant_config)
Sampling#
SamplingParams can customize the sampling strategy to control LLM generated responses, such as beam search, temperature, andothers.
As an example, to enable beam search with a beam size of 4, set thesampling_params as follows:
fromtensorrt_llm.llmapiimportLLM,SamplingParams,BuildConfigbuild_config=BuildConfig()build_config.max_beam_width=4llm=LLM(<llama_model_path>,build_config=build_config)# Let the LLM object generate text with the default sampling strategy, or# you can create a SamplingParams object as well with several fields set manuallysampling_params=SamplingParams(beam_width=4)# current limitation: beam_width should be equal to max_beam_widthforoutputinllm.generate(<prompt>,sampling_params=sampling_params):print(output)
SamplingParams manages and dispatches fields to C++ classes including:
Refer to theclass documentation for more details.
Build Configuration#
Apart from the arguments mentioned above, you can also customize the build configuration with thebuild_config class and other arguments borrowed from the trtllm-build CLI. These build configuration options provide flexibility in building engines for the target hardware and use cases. Refer to the following example:
llm=LLM(<model-path>,build_config=BuildConfig(max_num_tokens=4096,max_batch_size=128,max_beam_width=4))
Refer to thebuildconfig documentation for more details.
Runtime Customization#
Similar tobuild_config, you can also customize the runtime configuration with theruntime_config,peft_cache_config or otherarguments borrowed from the Executor APIs. These runtime configuration options provide additional flexibility with respect to KV cache management, GPU memory allocation and so on. Refer to the following example:
fromtensorrt_llm.llmapiimportLLM,KvCacheConfigllm=LLM(<llama_model_path>,kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.8))
Tokenizer Customization#
By default, the LLM API uses transformers’AutoTokenizer. You can override it with your own tokenizer by passing it when creating the LLM object. Refer to the following example:
llm=LLM(<llama_model_path>,tokenizer=<my_faster_one>)
The LLM() workflow should use your tokenizer instead.
It is also possible to input token IDs directly withoutTokenizers with the following code. The code produces token IDs without text because the tokenizer is not used.
llm=LLM(<llama_model_path>)foroutputinllm.generate([32,12]):...
Disable Tokenizer#
For performance considerations, you can disable the tokenizer by passingskip_tokenizer_init=True when creatingLLM. In this case,LLM.generate andLLM.generate_async will expect prompt token ids as input. Refer to the following example:
llm=LLM(<llama_model_path>)foroutputinllm.generate([[32,12]],skip_tokenizer_init=True):print(output)
You will get something like:
RequestOutput(request_id=1,prompt=None,prompt_token_ids=[1,15043,29892,590,1024,338],outputs=[CompletionOutput(index=0,text='',token_ids=[518,10858,4408,29962,322,306,626,263,518,10858,20627,29962,472,518,10858,6938,1822,306,626,5007,304,4653,590,4066,297,278,518,11947,18527,29962,2602,472],cumulative_logprob=None,logprobs=[])],finished=True)
Note that thetext field inCompletionOutput is empty since the tokenizer is deactivated.
Generation#
Asyncio-Based Generation#
With the LLM API, you can also perform asynchronous generation with thegenerate_async method. Refer to the following example:
llm=LLM(model=<llama_model_path>)asyncforoutputinllm.generate_async(<prompt>,streaming=True):print(output)
When thestreaming flag is set toTrue, thegenerate_async method will return a generator that yields each token as soon as it is available. Otherwise, it returns a generator that wait for and yields only the final results.
Future-Style Generation#
The result of thegenerate_async method is aFuture-like object, it doesn’t block the thread unless the.result() is called.
# This will not block the main threadgeneration=llm.generate_async(<prompt>)# Do something else here# call .result() to explicitly block the main thread and wait for the result when neededoutput=generation.result()
The.result() method works like theresult method in the Python Future, you can specify a timeout to wait for the result.
output=generation.result(timeout=10)
There is an async version, where the.aresult() is used.
generation=llm.generate_async(<prompt>)output=awaitgeneration.aresult()