Manual model conversion on GPU

This article introduces the manual workflow for converting LLM models using a local Nvidia GPU. It describes the required environment setup, execution steps, and how to run inference on a Windows Copilot+ PC with a Qualcomm NPU.

Conversion of LLM models requires a Nvidia GPU. If you want model lab to manage your local GPU, follow the steps inConvert Model. Otherwise, follow the steps in this article.

Manual run model conversion on GPU

This workflow is configured using theqnn_config.json file and requires two separate Python environments.

The first environment is used for model conversion with GPU acceleration and includes packages like onnxruntime-gpu and AutoGPTQ.
The second environment is used for QNN optimization and includes packages like onnxruntime-qnn with specific dependencies.

First environment setup

In aPython 3.10x64 Python environment with Olive installed, install the required packages:

# Install common dependenciespip install -r requirements.txt# Install ONNX Runtime GPU packagespip install "onnxruntime-gpu>=1.21.0" "onnxruntime-genai-cuda>=0.6.0"# AutoGPTQ: Install from source (stable package may be slow for weight packing)# Disable CUDA extension build (not required)# Linuxexport BUILD_CUDA_EXT=0# Windows# set BUILD_CUDA_EXT=0# Install AutoGPTQ from sourcepip install --no-build-isolation git+https://github.com/PanQiWei/AutoGPTQ.git# Please update CUDA version if neededpip install torch --index-url https://download.pytorch.org/whl/cu121

⚠️ Only set up the environment and install the packages. Do not run theolive run command at this point.

Second environment setup

In aPython 3.10x64 Python environment with Olive installed, install the required packages:

# Install ONNX Runtime QNNpip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txtpip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps

Replace/path/to/qnn/env/bin inqnn_config.json with the path to the directory containing thesecond environment's Python executable.

Run the config

Activate thefirst environment and run the workflow:

olive run --config qnn_config.json

After completing this command, the optimized model is saved in:./model/model_name.

⚠️ If optimization fails due to out of memory, please removecalibration_providers in config file.

⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.

Manual run inference samples

The optimized model can be used for inference usingONNX Runtime QNN Execution Provider andONNX Runtime GenAI. Inference must be run on aWindows Copilot+ PC with a Qualcomm NPU.

Install required packages on arm64 Python environment

Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment with Olive installed, install the required packages:

pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txtpip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-depspip install "onnxruntime-genai>=0.7.0rc2"

Run interface sample

Execute the providedinference_sample.ipynb notebook. Select ipykernel to thisarm64 Python environment.

⚠️ If you get a6033 error, replacegenai_config.json in the./model/model_name folder.

05/13/2025

ページ先頭