- Notifications
You must be signed in to change notification settings - Fork762
On-device AI across mobile, embedded and edge for PyTorch
License
pytorch/executorch
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
ExecuTorch is PyTorch's unified solution for deploying AI models on-device—from smartphones to microcontrollers—built for privacy, performance, and portability. It powers Meta's on-device AI acrossInstagram, WhatsApp, Quest 3, Ray-Ban Meta Smart Glasses, andmore.
DeployLLMs, vision, speech, and multimodal models with the same PyTorch APIs you already know—accelerating research to production with seamless model export, optimization, and deployment. No manual C++ rewrites. No format conversions. No vendor lock-in.
📘 Table of Contents
- 🔒 Native PyTorch Export — Direct export from PyTorch. No .onnx, .tflite, or intermediate format conversions. Preserve model semantics.
- ⚡ Production-Proven — Powers billions of users atMeta with real-time on-device inference.
- 💾 Tiny Runtime — 50KB base footprint. Runs on microcontrollers to high-end smartphones.
- 🚀12+ Hardware Backends — Open-source acceleration for Apple, Qualcomm, ARM, MediaTek, Vulkan, and more.
- 🎯 One Export, Multiple Backends — Switch hardware targets with a single line change. Deploy the same model everywhere.
ExecuTorch usesahead-of-time (AOT) compilation to prepare PyTorch models for edge deployment:
- 🧩 Export — Capture your PyTorch model graph with
torch.export() - ⚙️ Compile — Quantize, optimize, and partition to hardware backends →
.pte - 🚀 Execute — Load
.pteon-device via lightweight C++ runtime
Models use a standardizedCore ATen operator set.Partitioners delegate subgraphs to specialized hardware (NPU/GPU) with CPU fallback.
Learn more:How ExecuTorch Works •Architecture Guide
pip install executorch
For platform-specific setup (Android, iOS, embedded systems), see theQuick Start documentation for additional info.
importtorchfromexecutorch.exirimportto_edge_transform_and_lowerfromexecutorch.backends.xnnpack.partition.xnnpack_partitionerimportXnnpackPartitioner# 1. Export your PyTorch modelmodel=MyModel().eval()example_inputs= (torch.randn(1,3,224,224),)exported_program=torch.export.export(model,example_inputs)# 2. Optimize for target hardware (switch backends with one line)program=to_edge_transform_and_lower(exported_program,partitioner=[XnnpackPartitioner()]# CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm).to_executorch()# 3. Save for deploymentwithopen("model.pte","wb")asf:f.write(program.buffer)# Test locally via ExecuTorch runtime's pybind API (optional)fromexecutorch.runtimeimportRuntimeruntime=Runtime.get()method=runtime.load_program("model.pte").load_method("forward")outputs=method.execute([torch.randn(1,3,224,224)])
#include<executorch/extension/module/module.h>#include<executorch/extension/tensor/tensor.h>Modulemodule("model.pte");auto tensor = make_tensor_ptr({2,2}, {1.0f,2.0f,3.0f,4.0f});auto outputs =module.forward(tensor);
import ExecuTorchletmodule=Module(filePath:"model.pte")letinput=Tensor<Float>([1.0,2.0,3.0,4.0], shape:[2,2])letoutputs=try module.forward(input)
val module=Module.load("model.pte")val inputTensor=Tensor.fromBlob(floatArrayOf(1.0f,2.0f,3.0f,4.0f), longArrayOf(2,2))val outputs= module.forward(EValue.from(inputTensor))
Export Llama models using theexport_llm script orOptimum-ExecuTorch:
# Using export_llmpython -m executorch.extension.llm.export.export_llm --model llama3_2 --output llama.pte# Using Optimum-ExecuTorchoptimum-cliexport executorch \ --model meta-llama/Llama-3.2-1B \ --task text-generation \ --recipe xnnpack \ --output_dir llama_model
Run on-device with the LLM runner API:
#include<executorch/extension/llm/runner/text_llm_runner.h>auto runner = create_llama_runner("llama.pte","tiktoken.bin");executorch::extension::llm::GenerationConfig config{ .seq_len =128, .temperature =0.8f};runner->generate("Hello, how are you?", config);
import ExecuTorchLLMletrunner=TextRunner(modelPath:"llama.pte", tokenizerPath:"tiktoken.bin")try runner.generate("Hello, how are you?",Config{ $0.sequenceLength=128}){ tokeninprint(token, terminator:"")}
Kotlin (Android) —API Docs •Demo App
val llmModule=LlmModule("llama.pte","tiktoken.bin",0.8f)llmModule.load()llmModule.generate("Hello, how are you?",128,object:LlmCallback {overridefunonResult(result:String) {print(result) }overridefunonStats(stats:String) { }})
For multimodal models (vision, audio), use theMultiModal runner API which extends the LLM runner to handle image and audio inputs alongside text. SeeLlava andVoxtral examples.
Seeexamples/models/llama for complete workflow including quantization, mobile deployment, and advanced options.
Next Steps:
- 📖Step-by-step tutorial — Complete walkthrough for your first model
- ⚡Colab notebook — Try ExecuTorch instantly in your browser
- 🤖Deploy Llama models — LLM workflow with quantization and mobile demos
| Platform | Supported Backends |
|---|---|
| Android | XNNPACK, Vulkan, Qualcomm, MediaTek, Samsung Exynos |
| iOS | XNNPACK, MPS, CoreML (Neural Engine) |
| Linux / Windows | XNNPACK, OpenVINO, CUDA(experimental) |
| macOS | XNNPACK, MPS, Metal(experimental) |
| Embedded / MCU | XNNPACK, ARM Ethos-U, NXP, Cadence DSP |
SeeBackend Documentation for detailed hardware requirements and optimization guides.
ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devices, and partner deployments.View success stories →
LLMs:Llama 3.2/3.1/3,Qwen 3,Phi-4-mini,LiquidAI LFM2
Multimodal:Llava (vision-language),Voxtral (audio-language),Gemma (vision-language)
Vision/Speech:MobileNetV2,DeepLabV3,Whisper
Resources:examples/ directory •executorch-examples out-of-tree demos •Optimum-ExecuTorch for HuggingFace models
ExecuTorch provides advanced capabilities for production deployment:
- Quantization — Built-in support viatorchao for 8-bit, 4-bit, and dynamic quantization
- Memory Planning — Optimize memory usage with ahead-of-time allocation strategies
- Developer Tools — ETDump profiler, ETRecord inspector, and model debugger
- Selective Build — Strip unused operators to minimize binary size
- Custom Operators — Extend with domain-specific kernels
- Dynamic Shapes — Support variable input sizes with bounded ranges
SeeAdvanced Topics for quantization techniques, custom backends, and compiler passes.
- Documentation Home — Complete guides and tutorials
- API Reference — Python, C++, Java/Kotlin APIs
- Backend Integration — Build custom hardware backends
- Troubleshooting — Common issues and solutions
We welcome contributions from the community!
- 💬GitHub Discussions — Ask questions and share ideas
- 🎮Discord — Chat with the team and community
- 🐛Issues — Report bugs or request features
- 🤝Contributing Guide — Guidelines and codebase structure
ExecuTorch is BSD licensed, as found in theLICENSE file.
Part of the PyTorch ecosystem
About
On-device AI across mobile, embedded and edge for PyTorch
Topics
Resources
License
Code of conduct
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
