Manual model conversion on GPU
This article introduces the manual workflow for converting LLM models using a local Nvidia GPU. It describes the required environment setup, execution steps, and how to run inference on a Windows Copilot+ PC with a Qualcomm NPU.
Conversion of LLM models requires a Nvidia GPU. If you want model lab to manage your local GPU, follow the steps inConvert Model. Otherwise, follow the steps in this article.
Manual run model conversion on GPU
This workflow is configured using theqnn_config.json
file and requires two separate Python environments.
- The first environment is used for model conversion with GPU acceleration and includes packages like onnxruntime-gpu and AutoGPTQ.
- The second environment is used for QNN optimization and includes packages like onnxruntime-qnn with specific dependencies.
First environment setup
In aPython 3.10x64 Python environment with Olive installed, install the required packages:
# Install common dependenciespip install -r requirements.txt# Install ONNX Runtime GPU packagespip install "onnxruntime-gpu>=1.21.0" "onnxruntime-genai-cuda>=0.6.0"# AutoGPTQ: Install from source (stable package may be slow for weight packing)# Disable CUDA extension build (not required)# Linuxexport BUILD_CUDA_EXT=0# Windows# set BUILD_CUDA_EXT=0# Install AutoGPTQ from sourcepip install --no-build-isolation git+https://github.com/PanQiWei/AutoGPTQ.git# Please update CUDA version if neededpip install torch --index-url https://download.pytorch.org/whl/cu121
⚠️ Only set up the environment and install the packages. Do not run the
olive run
command at this point.
Second environment setup
In aPython 3.10x64 Python environment with Olive installed, install the required packages:
# Install ONNX Runtime QNNpip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txtpip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
Replace/path/to/qnn/env/bin
inqnn_config.json
with the path to the directory containing thesecond environment's Python executable.
Run the config
Activate thefirst environment and run the workflow:
olive run --config qnn_config.json
After completing this command, the optimized model is saved in:./model/model_name
.
⚠️ If optimization fails due to out of memory, please remove
calibration_providers
in config file.
⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.
Manual run inference samples
The optimized model can be used for inference usingONNX Runtime QNN Execution Provider andONNX Runtime GenAI. Inference must be run on aWindows Copilot+ PC with a Qualcomm NPU.
Install required packages on arm64 Python environment
Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment with Olive installed, install the required packages:
pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txtpip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-depspip install "onnxruntime-genai>=0.7.0rc2"
Run interface sample
Execute the providedinference_sample.ipynb
notebook. Select ipykernel to thisarm64 Python environment.
⚠️ If you get a
6033
error, replacegenai_config.json
in the./model/model_name
folder.