Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Inference Vision Transformer (ViT) in plain C/C++ with ggml

License

NotificationsYou must be signed in to change notification settings

staghado/vit.cpp

Repository files navigation

Inference Vision Transformer (ViT) in plain C/C++ using ggml without any extra dependencies

Description

This project presents a standalone implementation of the well known Vision Transformer (ViT) model family, used in a broad spectrum of applications and SOTA models like Large Multimodal Models(LMM). The primary goal is to develop a C/C++ inference engine tailored for ViT models, utilizingggml to enhance performance, particularly on edge devices. Designed to be both lightweight and self-contained, this implementation can be run across diverse platforms.

Table of Contents
  1. Description
  2. Features
  3. Vision Transformer Architecture
  4. Quick Example
  5. Convert PyTorch to GGUF
  6. Build
  7. Run
  8. Benchmark against PyTorch
  9. Quantization
  10. To-Do List

Features

  • Dependency-free and lightweight inference thanks toggml.
  • 4-bit, 5-bit and 8-bit quantization support.
  • Support for timm ViTs with different variants out of the box.

An important aspect of usingvit.cpp is that it has short startup times compared to common DL frameworks, which makes it suitable for serverless deployments where the cold start is an issue.

Vision Transformer architecture

The implemented architecture is based on the original Vision Transformer from:

Vision Transformer overview

ViT architecture. Taken from theoriginal paper.

Quick example

Details

example input

See output
  $ ./bin/vit -t 4 -m ../ggml-model-f16.gguf -i ../assets/magpie.jpeg -k 5  main: seed = 1701176263  main: n_threads = 4 / 8  vit_model_load: loading model from '../ggml-model-f16.gguf' - please wait  vit_model_load: hidden_size            = 192  vit_model_load: num_hidden_layers      = 12  vit_model_load: num_attention_heads    = 3  vit_model_load: patch_size             = 16  vit_model_load: img_size               = 224  vit_model_load: num_classes            = 1000  vit_model_load: ftype                  = 1  vit_model_load: qntvr                  = 0  operator(): ggml ctx size =  11.13 MB  vit_model_load: ................... done  vit_model_load: model size =    11.04 MB / num tensors = 152  main: loaded image '../assets/magpie.jpeg' (500 x 470)  vit_image_preprocess: scale = 2.232143  processed, out dims : (224 x 224)

> magpie : 0.87> goose : 0.02> toucan : 0.01> drake : 0.01> king penguin, Aptenodytes patagonica : 0.01

main: model load time = 17.92 msmain: processing time = 146.96 msmain: total time = 164.88 ms

Convert PyTorch to GGUF

# clone the repo recursivelygit clone --recurse-submodules https://github.com/staghado/vit.cpp.gitcd vit.cpp# install torch and timmpip install torch timm# list available models if needed; note that not all models are supportedpython convert-pth-to-ggml.py --list# convert the weights to gguf : vit tiny with patch size of 16 and an image # size of 384 pre-trained on ImageNet21k and fine-tuned on ImageNet1kpython convert-pth-to-ggml.py --model_name vit_tiny_patch16_384.augreg_in21k_ft_in1k --ftype 1

Note: You can also download the converted weights fromHugging Face directly.

wget https://huggingface.co/staghado/vit.cpp/blob/main/tiny-ggml-model-f16.gguf

Build

Simple build

# build ggml and vit mkdir build && cd buildcmake .. && make -j4# run inference./bin/vit -t 4 -m ../ggml-model-f16.gguf -i ../assets/tench.jpg

The optimal number of threads to use depends on many factors and more is not always better. Usually using a number of threads equal to the number of available physical cores gives the best performance in terms of speed.

Per device optimizations

Generate per-device instructions that work best for the given machine rather than using general CPU instructions.This can be done by specifying-march=native in the compiler flags.

  • Multi-threading and vectorization
  • Loop transformations(unrolling)

For AMD host processors

You can use a specialized compiler released by AMD to make full use of your specific processor's architecture.Read more here :AMD Optimizing C/C++ and Fortran Compilers (AOCC)

You can follow the given instructions to install the AOCC compiler.

Note : For my AMD Ryzen 7 3700U, the improvements were not very significant but for more recent processors there could be a gain in using a specialized compiler.

Using OpenMP

Additionally compile with OpenMP by specifying the-fopenmp flag to the compiler in the CMakeLists file,allowing multithreaded runs. Make sure to also enable multiple threads when running, e.g.:

OMP_NUM_THREADS=4 ./bin/vit -t 4 -m ../ggml-model-f16.bin -i ../assets/tench.jpg

Run

usage: ./bin/vit [options]options:  -h, --help              show this help message and exit  -s SEED, --seed SEED    RNG seed (default: -1)  -t N, --threads N       number of threads to use during computation (default: 4)  -m FNAME, --model FNAME model path (default: ../ggml-model-f16.bin)  -i FNAME, --inp FNAME   input file (default: ../assets/tench.jpg)  -k N, --topk N          top k classes to print (default: 5)  -e FLOAT, --epsilon     epsilon (default: 0.000001)

Benchmark against PyTorch

First experiments on Apple M1 show inference speedups(up to 6x faster for base model) compared to native PyTorch inference.

ViT inference

You can efficiently run ViT inference on the CPU.Memory requirements and inference speed on AMD Ryzen 7 3700U(4 cores, 8 threads) for both native PyTorch andvit.cpp.Using 4 threads gives better results for my machine. The reported results of inference speed correspond to 10 runs averages for both PyTorch andvit.cpp.

ModelMax Mem(PyTorch)Max MemSpeed(PyTorch)Speed
tiny~780 MB~20 MB431 ms120 ms
small~965 MB~52 MB780 ms463 ms
base~1.61 GB~179 MB2393 ms1441 ms
large~3.86 GB~597 MB8151 ms4892 ms

Note: The models used are of the formvit_{size}_patch16_224.augreg_in21k_ft_in1k.

Benchmark on your machine

In order to test the inference speed on your machine, you can run the following scripts:

chmod +x scripts/benchmark.*# install memory_profiler & threadpoolctlpip install memory_profiler threadpoolctl# run the benchmark of PyTorchpython scripts/benchmark.py# run the benchmark of vit.cpp for non-qunatized model./scripts/benchmark.sh# to run the benchamrk for qunatized models; 4 threads and quantize flag./scripts/benchmark.sh 4 1

Both scripts use 4 threads by default. In Python, thethreadpoolctl library is used to limit the number of threads used by PyTorch.

Quantization

vit.cpp supports many quantization strategies from ggml such as q4_0, q4_1, q5_0, q5_1 and q8_0 types.You can quantize a model in F32 (the patch embedding is in F16) to one of these types by using the./bin/quantize binary.

usage: ./bin/quantize /path/to/ggml-model-f32.gguf /path/to/ggml-model-quantized.gguf type                                type = 2 - q4_0                                                                                                         type = 3 - q4_1                                                                                                         type = 6 - q5_0                                                                                                         type = 7 - q5_1                                                                                                         type = 8 - q8_0

For example, you can run the following to convert the model to q5_1:

./bin/quantize ../tiny-ggml-model-f16.gguf ../tiny-ggml-model-f16-quant.gguf 7

Then you can usetiny-ggml-model-f16-quant.gguf just like the model in F16.

Results

Here are the benchmarks for the different models and quantizations on my machine:For accurate estimation of run times, these benchmarks were run 100 times each.

ModelQuantizationSpeed (ms)Mem (MB)
tinyq4_0105 ms12 MB
tinyq4_197 ms12 MB
tinyq5_0116 ms13 MB
tinyq5_1112 ms13 MB
tinyq8_090 ms15 MB
smallq4_0240 ms23 MB
smallq4_1224 ms24 MB
smallq5_0288 ms25 MB
smallq5_1277 ms27 MB
smallq8_0228 ms33 MB
baseq4_0704 ms61 MB
baseq4_1626 ms66 MB
baseq5_0851 ms71 MB
baseq5_1806 ms76 MB
baseq8_0659 ms102 MB
largeq4_02189 ms181 MB
largeq4_11919 ms199 MB
largeq5_02676 ms217 MB
largeq5_12547 ms235 MB
largeq8_01994 ms325 MB

To-Do List

  • Evaluate performance on ImageNet1k:

    Run evaluation on ImageNet1k test set and analyze the performance of different quantization schemes.

This project was highly inspired by the following projects:

Star History

Star History Chart

About

Inference Vision Transformer (ViT) in plain C/C++ with ggml

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors3

  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp