Optimum Inference with ONNX Runtime
Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime.Optimum can be used to load optimized models from theHugging Face Hub and create pipelinesto run accelerated inference without rewriting your APIs.
Loading
Transformers models
Once your model wasexported to the ONNX format, you can load it by replacingAutoModelForXxx
with the correspondingORTModelForXxx
class.
from transformers import AutoTokenizer, pipeline- from transformers import AutoModelForCausalLM+ from optimum.onnxruntime import ORTModelForCausalLM- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint+ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) result = pipe("He never went out without a book under his arm")
More information for all the supportedORTModelForXxx
in ourdocumentation
Diffusers models
Once your model wasexported to the ONNX format, you can load it by replacingDiffusionPipeline
with the correspondingORTDiffusionPipeline
class.
- from diffusers import DiffusionPipeline+ from optimum.onnxruntime import ORTDiffusionPipeline model_id = "runwayml/stable-diffusion-v1-5"- pipeline = DiffusionPipeline.from_pretrained(model_id)+ pipeline = ORTDiffusionPipeline.from_pretrained(model_id, revision="onnx") prompt = "sailing ship in storm by Leonardo da Vinci" image = pipeline(prompt).images[0]
Sentence Transformers models
Once your model wasexported to the ONNX format, you can load it by replacingAutoModel
with the correspondingORTModelForFeatureExtraction
class.
from transformers import AutoTokenizer- from transformers import AutoModel+ from optimum.onnxruntime import ORTModelForFeatureExtraction tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")- model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")+ model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2") inputs = tokenizer("This is an example sentence", return_tensors="pt") outputs = model(**inputs)
You can also load your ONNX model directly using thesentence_transformers.SentenceTransformer
class, just make sure to havesentence-transformers>=3.2
installed. If the model wasn’t already converted to ONNX, it will be converted automatically on-the-fly.
from sentence_transformers import SentenceTransformer model_id = "sentence-transformers/all-MiniLM-L6-v2"- model = SentenceTransformer(model_id)+ model = SentenceTransformer(model_id, backend="onnx") sentences = ["This is an example sentence", "Each sentence is converted"] embeddings = model.encode(sentences)
Timm models
Once your model wasexported to the ONNX format, you can load it by replacing thecreate_model
with the correspondingORTModelForImageClassification
class.
import requests from PIL import Image- from timm import create_model from timm.data import resolve_data_config, create_transform+ from optimum.onnxruntime import ORTModelForImageClassification- model = create_model("timm/mobilenetv3_large_100.ra_in1k", pretrained=True)+ model = ORTModelForImageClassification.from_pretrained("optimum/mobilenetv3_large_100.ra_in1k") transform = create_transform(**resolve_data_config(model.config.pretrained_cfg, model=model)) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png" image = Image.open(requests.get(url, stream=True).raw) inputs = transform(image).unsqueeze(0) outputs = model(inputs)
Converting your model to ONNX on-the-fly
In case your model wasn’t alreadyconverted to ONNX,ORTModel includes a method to convert your model to ONNX on-the-fly.Simply passexport=True
to thefrom_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly:
>>>from optimum.onnxruntimeimport ORTModelForSequenceClassification>>># Load the model from the hub and export it to the ONNX format>>>model_id ="distilbert-base-uncased-finetuned-sst-2-english">>>model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
Pushing your model to the Hub
You can also callpush_to_hub
directly on your model to upload it to theHub.
>>>from optimum.onnxruntimeimport ORTModelForSequenceClassification>>># Load the model from the hub and export it to the ONNX format>>>model_id ="distilbert-base-uncased-finetuned-sst-2-english">>>model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)>>># Save the converted model locally>>>output_dir ="a_local_path_for_convert_onnx_model">>>model.save_pretrained(output_dir)# Push the onnx model to HF Hub>>>model.push_to_hub(output_dir, repository_id="my-onnx-repo")