Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.

License

NotificationsYou must be signed in to change notification settings

microsoft/Olive

Repository files navigation

olive text

PyPI releaseDocumentation

AI Model Optimization Toolkit for the ONNX Runtime

Given a model and targeted hardware, Olive (abbreviation ofOnnxLIVE) composes the best suitable optimization techniques to output the most efficient ONNX model(s) for inferencing on the cloud or edge, while taking a set of constraints such as accuracy and latency into consideration.

✅ Benefits of using Olive

  • Reduce frustration of manual trial-and-error model optimization experimentation. Define your target and precision and let Olive automatically produce the best model for you.
  • 40+ built-in model optimization components covering industry-leading techniques across model compression, optimization, finetuning, and compilation.
  • Easy-to-use CLI for common model optimization tasks.
  • Workflows to orchestrate model transformations and optimizations steps.
  • Support for compiling LoRA adapters forMultiLoRA serving.
  • Seamless integration withHugging Face andAzure AI.
  • Built-incaching mechanism toimprove productivity.

📰 News Highlights

Here are some recent videos, blog articles and labs that highlight Olive:

For a full list of news and blogs, read thenews archive.

🚀 Getting Started

Notebooks available!

The following notebooks are available that demonstrate key optimization workflows with Olive and include the application code to inference the optimized models on the ONNX Runtime.

TitleTaskDescriptionTime RequiredNotebook Links
QuickstartText GenerationLearn how to quantize & optimize an SLM for the ONNX Runtime using a single Olive command.5minsDownload /Open in Colab
Optimizing popular SLMsText GenerationChoose from a curated list of over 20 popular SLMs to quantize & optimize for the ONNX runtime.5minsDownload /Open in Colab
How to finetune models for on-device inferenceText GenerationLearn how to Quantize (using AWQ method), fine-tune, and optimize an SLM for on-device inference.15minsDownload /Open in Colab
Finetune and Optimize DeepSeek R1 with OliveText GenerationLearn how to Finetune and Optimize DeepSeek-R1-Distill-Qwen-1.5B for on-device inference.15minsDownload /Open in Colab

✨ Quickstart

If you prefer using the command line directly instead of Jupyter notebooks, we've outlined the quickstart commands here.

1. Install Olive CLI

We recommend installing Olive in avirtual environment or aconda environment.

pip install olive-ai[auto-opt]pip install transformers onnxruntime-genai

Note

Olive has optional dependencies that can be installed to enable additional features. Please refer toOlive package config for the list of extras and their dependencies.

2. Automatic Optimizer

In this quickstart you'll be optimizingHuggingFaceTB/SmolLM2-135M-Instruct, which has many model files in the Hugging Face repo for different precisions that are not required by Olive. To minimize the download, cache the original Hugging Face model files (safetensors and configuration) in the main folder of the Hugging Face repo using:

huggingface-cli download HuggingFaceTB/SmolLM2-135M-Instruct*.json*.safetensors*.txt

Next, run the automatic optimization:

olive auto-opt \    --model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct \    --output_path models/smolm2 \    --device cpu \    --provider CPUExecutionProvider \    --use_ort_genai \    --precision int4 \    --log_level 1

Tip

PowerShell UsersLine continuation between Bash and PowerShell are not interchangable. If you are using PowerShell, then you can copy-and-paste the following command that uses compatible line continuation.
olive auto-opt`--model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct`--output_path models/smolm2`--device cpu`--provider CPUExecutionProvider`--use_ort_genai`--precision int4`--log_level1

The automatic optimizer will:

  1. Acquire the model from the local cache (note: if you skipped the model download step then the entire contents of the Hugging Face model repo will be downloaded).
  2. Capture the ONNX Graph and store the weights in an ONNX data file.
  3. Optimize the ONNX Graph.
  4. Quantize the model toint4 using RTN method.

Olive can automatically optimize popular modelarchitectures like Llama, Phi, Qwen, Gemma, etc out-of-the-box -see detailed list here. Also, you can optimize other model architectures by providing details on the input/outputs of the model (io_config).

3. Inference on the ONNX Runtime

The ONNX Runtime (ORT) is a fast and light-weight cross-platform inference engine with bindings for popular programming language such as Python, C/C++, C#, Java, JavaScript, etc. ORT enables you to infuse AI models into your applications so that inference is handled on-device.

The following code creates a simple console-based chat interface that inferences your optimized model -select Python and/or C# to expand the code:

Python

Create a Python file calledapp.py and copy and paste the following code:

# app.pyimportonnxruntime_genaiasogmodel_folder="models/smolm2/model"# Load the base model and tokenizermodel=og.Model(model_folder)tokenizer=og.Tokenizer(model)tokenizer_stream=tokenizer.create_stream()# Set the max length to something sensible by default,# since otherwise it will be set to the entire context lengthsearch_options= {}search_options['max_length']=200chat_template="<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n"# Keep asking for input prompts in a loopwhileTrue:text=input("Prompt (Use quit() to exit): ")ifnottext:print("Error, input cannot be empty")continueiftext=="quit()":break# Generate prompt (prompt template + input)prompt=f'{chat_template.format(input=text)}'# Encode the prompt using the tokenizerinput_tokens=tokenizer.encode(prompt)# Create params and generatorparams=og.GeneratorParams(model)params.set_search_options(**search_options)generator=og.Generator(model,params)# Append input tokens to the generatorgenerator.append_tokens(input_tokens)print("")print("Output: ",end='',flush=True)# Stream the outputtry:whilenotgenerator.is_done():generator.generate_next_token()new_token=generator.get_next_tokens()[0]print(tokenizer_stream.decode(new_token),end='',flush=True)exceptKeyboardInterrupt:print("  --control+c pressed, aborting generation--")print()print()delgenerator

To run the code, executepython app.py. You'll be prompted to enter a message to the SLM - for example, you could askwhat is the golden ratio, ordef print_hello_world():. To exit typequit() in the chat interface.

C#

Create a new C# Console app and install theMicrosoft.ML.OnnxRuntimeGenAI Nuget package into your project:

mkdir ortappcd ortappdotnet new consoledotnet add package Microsoft.ML.OnnxRuntimeGenAI--version0.5.2

Next, copy-and-paste the following code into yourProgram.cs file and updatemodelPath variable to be theabsolute path of where you stored your optimized model.

// Program.csusingMicrosoft.ML.OnnxRuntimeGenAI;internalclassProgram{privatestaticvoidMain(string[]args){stringmodelPath@"models/smolm2/model";Console.Write("Loading model from "+modelPath+"...");usingModelmodel=new(modelPath);Console.Write("Done\n");usingTokenizertokenizer=new(model);usingTokenizerStreamtokenizerStream=tokenizer.CreateStream();while(true){Console.Write("User:");stringprompt="<|im_start|>user\n"+Console.ReadLine()+"<|im_end|>\n<|im_start|>assistant\n";varsequences=tokenizer.Encode(prompt);usingGeneratorParamsgParams=newGeneratorParams(model);gParams.SetSearchOption("max_length",200);usingGeneratorgenerator=new(model,gParams);generator.AppendTokenSequences(sequences);Console.Out.Write("\nAI:");while(!generator.IsDone()){generator.GenerateNextToken();vartoken=generator.GetSequence(0)[^1]                Console.Out.Write(tokenizerStream.Decode(token));Console.Out.Flush();}Console.WriteLine();}}}

Run the application:

dotnet run

You'll be prompted to enter a message to the SLM - for example, you could askwhat is the golden ratio, ordef print_hello_world():. To exit typeexit in the chat interface.

🎓 Learn more

🤝 Contributions and Feedback

⚖️ License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under theMIT License.

Pipeline Status

Build StatusBuild StatusBuild Status

About

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp