Running LLMs with C++#
This guide explains how to use ExecuTorch’s C++ runner library to run LLM models that have been exported to the.pte format. The runner library provides a high-level API for text generation with LLMs, handling tokenization, inference, and token generation.
Prerequisites#
Before you begin, make sure you have:
A model exported to
.pteformat using theexport_llmAPI as described inExporting popular LLMs out of the box orExporting custom LLMs.Please also seeModel Metadata section for important metadata to be serialized into
.pte.
A tokenizer file compatible with your model
For HuggingFace tokenizers, this is a JSON file
tokenizer.jsonFor SentencePiece tokenizers, this is a
tokenizer.modelfile and normally lives alongside the weights file
CMake and a C++ compiler installed
CMake version 3.29 or higher
g++ or clang compiler
Model Metadata#
The metadata includes several important configuration parameters to be included during export step, which will be used by the runner library:
enable_dynamic_shape: Whether the model supports dynamic input shapesmax_seq_len: Maximum sequence length the model can handlemax_context_len: Maximum context length for KV cacheuse_kv_cache: Whether the model uses KV cache for efficient generationget_bos_id: Beginning-of-sequence token IDget_eos_ids: End-of-sequence token IDs
Adding Metadata During Export#
To ensure your model has the necessary metadata, you can specify it during export using themetadata parameter in the export configuration:
# export_llmpython-mextension.llm.export.export_llm \--configpath/to/config.yaml \+base.metadata='{"get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_max_context_len":4096}'
Building the Runner Library#
The ExecuTorch LLM runner library can be built using CMake. To integrate it into your project:
Add ExecuTorch as a dependency in your CMake project
Enable the required components (extension_module, extension_tensor, etc.)
Link your application against the
extension_llm_runnerlibrary
Here’s a simplified example of the CMake configuration:
# Enable required componentsset_overridable_option(EXECUTORCH_BUILD_EXTENSION_MODULEON)set_overridable_option(EXECUTORCH_BUILD_EXTENSION_TENSORON)set_overridable_option(EXECUTORCH_BUILD_EXTENSION_LLM_RUNNERON)# Add ExecuTorch as a dependencyadd_subdirectory(executorch)# Link against the LLM runner librarytarget_link_libraries(your_appPRIVATEextension_llm_runner)
Building the Llama Runner#
ExecuTorch provides a complete example of a C++ runner for Llama models in theexamples/models/llama directory. This runner demonstrates how to use the LLM runner library to run Llama models exported to the.pte format.
Please note that this runner library is not limited to Llama models and can be used with any text-only decoder-only LLM model that has been exported to the.pte.
Basic Usage Example#
Here’s a simplified example of using the runner:
#include<executorch/extension/llm/runner/text_llm_runner.h>usingnamespaceexecutorch::extension::llm;intmain(){// Load tokenizer and create runnerautotokenizer=load_tokenizer("path/to/tokenizer.json",nullptr,std::nullopt,0,0);autorunner=create_text_llm_runner("path/to/model.pte",std::move(tokenizer));// Load the modelrunner->load();// Configure generationGenerationConfigconfig;config.max_new_tokens=100;config.temperature=0.8f;// Generate text with streaming outputrunner->generate("Hello, world!",config,[](conststd::string&token){std::cout<<token<<std::flush;},nullptr);return0;}
The Runner API Architecture#
The ExecuTorch LLM runner library is designed with a modular architecture that separates concerns between different components of the text generation pipeline.
IRunner Interface#
TheIRunner interface (irunner.h) defines the core functionality for LLM text generation. This interface serves as the primary abstraction for interacting with LLM models:
classIRunner{public:virtual~IRunner()=default;virtualboolis_loaded()const=0;virtualruntime::Errorload()=0;virtualruntime::Errorgenerate(...)=0;virtualruntime::Errorgenerate_from_pos(...)=0;virtualvoidstop()=0;};
Let’s examine each method in detail:
boolis_loaded()const
Checks if the model and all necessary resources have been loaded into memory and are ready for inference. This method is useful for verifying the runner’s state before attempting to generate text.
runtime::Errorload()
Loads the model and prepares it for inference. This includes:
Loading the model weights from the
.ptefileInitializing any necessary buffers or caches
Preparing the execution environment
This method should be called before any generation attempts. It returns anError object indicating success or failure.
runtime::Errorgenerate(conststd::string&prompt,constGenerationConfig&config,std::function<void(conststd::string&)>token_callback,std::function<void(constStats&)>stats_callback)
The primary method for text generation. It takes:
prompt: The input text to generate fromconfig: Configuration parameters controlling the generation processtoken_callback: A callback function that receives each generated token as a stringstats_callback: A callback function that receives performance statistics after generation completes
The token callback is called for each token as it’s generated, allowing for streaming output. The stats callback provides detailed performance metrics after generation completes.
runtime::Errorgenerate_from_pos(conststd::string&prompt,int64_tstart_pos,constGenerationConfig&config,std::function<void(conststd::string&)>token_callback,std::function<void(constStats&)>stats_callback)
An advanced version ofgenerate() that allows starting generation from a specific position in the KV cache. This is useful for continuing generation from a previous state.
voidstop()
Immediately stops the generation loop. This is typically called from another thread to interrupt a long-running generation.
GenerationConfig Structure#
TheGenerationConfig struct controls various aspects of the generation process:
structGenerationConfig{boolecho=true;// Whether to echo the input prompt in the outputint32_tmax_new_tokens=-1;// Maximum number of new tokens to generateboolwarming=false;// Whether this is a warmup runint32_tseq_len=-1;// Maximum number of total tokensfloattemperature=0.8f;// Temperature for samplingint32_tnum_bos=0;// Number of BOS tokens to addint32_tnum_eos=0;// Number of EOS tokens to add// Helper method to resolve the actual max_new_tokens based on constraintsint32_tresolve_max_new_tokens(int32_tmax_context_len,int32_tnum_prompt_tokens)const;};
Theresolve_max_new_tokens method handles the logic of determining how many tokens can be generated based on:
The model’s maximum context length
The number of tokens in the prompt
The user-specified maximum sequence length and maximum new tokens
Implementation Components#
The runner library consists of several specialized components that work together:
TextLLMRunner#
The main implementation of theIRunner interface that orchestrates the text generation process. It manages:
Tokenization of input text
Prefilling the KV cache with prompt tokens
Generating new tokens one by one
Collecting performance statistics
TextPrefiller#
Responsible for processing the initial prompt tokens and filling the KV cache. Key features:
Efficiently processes large prompts
Handles dynamic sequence lengths
Supports parallel prefilling for performance optimization
TextTokenGenerator#
Generates new tokens one by one in an autoregressive manner. It:
Manages the token generation loop
Applies temperature-based sampling
Detects end-of-sequence conditions
Streams tokens as they’re generated
TextDecoderRunner#
Interfaces with the ExecuTorch Module to run the model forward pass. It:
Manages inputs and outputs to the model
Handles KV cache updates
Converts logits to tokens via sampling
Tokenizer Support#
The runner library supports multiple tokenizer formats through a unified interface:
std::unique_ptr<tokenizers::Tokenizer>tokenizer=load_tokenizer(tokenizer_path,// Path to tokenizer filenullptr,// Optional special tokensstd::nullopt,// Optional regex pattern (for TikToken)0,// BOS token index0// EOS token index);
Supported tokenizer formats include:
HuggingFace Tokenizers: JSON format tokenizers
SentencePiece:
.modelformat tokenizersTikToken: BPE tokenizers
Llama2c: BPE tokenizers in the Llama2.c format
For custom tokenizers, you can find implementations in themeta-pytorch/tokenizers repository.
Other APIs#
Model Warmup#
For more accurate timing and optimal performance, you should perform a warmup run before actual inference:
runner->warmup("Hello world",10);// Generate 10 tokens as warmup
During warmup:
A special
GenerationConfigis created with:echo=false: The prompt is not included in the outputwarming=true: Indicates this is a warmup runmax_new_tokens: Set to the specified number of tokens to generate
The model runs through the entire generation pipeline:
Loading the model (if not already loaded)
Tokenizing the prompt
Prefilling the KV cache
Generating the specified number of tokens
Special behavior during warmup:
Tokens are not displayed to the console
The runner logs “Doing a warmup run…” and “Warmup run finished!” messages
After warmup:
The
Statsobject is reset to clear performance metricsThe model remains loaded and ready for actual inference
Warmup is particularly important for accurate benchmarking as the first inference often includes one-time initialization costs that would skew performance measurements.
Memory Usage Monitoring#
You can monitor memory usage with theStats object:
std::cout<<"RSS after loading: "<<get_rss_bytes()/1024.0/1024.0<<" MiB"<<std::endl;
