Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

License

NotificationsYou must be signed in to change notification settings

tlkh/tf-metal-experiments

Repository files navigation

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Setup

This is tested on M1 series Apple Silicon SOC only.

TensorFlow 2.x

  1. Follow the official instructions from Applehere
  2. Test that your Metal GPU is working by runningtf.config.list_physical_devices("GPU"), you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that saysMetal device set to: Apple M1 Max or similar.
  3. Now you should be ready to run any TF code that doesn't require external libraries.

HuggingFace Transformers library

If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.

  1. Install theregex library (I don't know why it has to be like this, but yeah):python3 -m pip install --upgrade regex --no-use-pep517. You might need doxcode-select --install if the above command doesn't work.
  2. pip install transformers ipywidgets

Experiments and Benchmarks

After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max.

  • For all the cases here, increasing batch size does not seem to increase the throughput.
  • High Power Mode enabled + plugged into charger (this does not seem to affect the benchmarks anyway)

Power draw also doesn't seem to be able to go much higher than ~40W:

  • Power draw from the GPU (averaged over 1 second) can be measured withsudo powermetrics --samplers gpu_power -i1000 -n1.
  • I decided to report peak power as observed viaasitop (see:tlkh/asitop)
ModelGPUBatchSizeThroughputPeak PowerMemory
ResNet50M1 Max 32c128140 img/sec42W21 GB
MobileNetV2M1 Max 32c128352 img/sec37W13 GB
DistilBERTM1 Max 32c64120 seq/sec35W9 GB
BERTLargeM1 Max 32c1619 seq/sec36W14 GB

The benchmark scripts used are included in this repo.

python train_benchmark.py --type cnn --model resnet50python train_benchmark.py --type cnn --model mobilenetv2python train_benchmark.py --type transformer --model distilbert-base-uncasedpython train_benchmark.py --type transformer --model bert-large-uncased --bs 16

Reference Benchmarks from RTX 3090

ModelGPUBatchSizeThroughputPower
Same Batch Size as M1
ResNet5030901281100 img/sec360W
MobileNetV230901282001 img/sec340W
DistilBERT3090641065 seq/sec360W
BERTLarge309016131 seq/sec335W
Larger Batch Size
ResNet5030902561185 img/sec370W
MobileNetV230902562197 img/sec350W
DistilBERT30902561340 seq/sec380W
BERTLarge309064193 seq/sec365W

For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. Also increase the length of an epoch, as sometimes 3090 is too fast and results in poorer measurement due to overhead of starting/ending the training which finishes in seconds.

Note: 3090 running at 400W power limit. CPU is 5600X.

# config for NVIDIA Tensor Core GPU# run with more steps, XLA and FP16 (enable tensor core aka mixed precision)python train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100python train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100python train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100python train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30# If no Tensor Core, remove --fp16 flag

Measuring Achievable TFLOPS

We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around >8 TFLOPS for large enough problem sizes.

The plot can be generated usingtflops_sweep.py.

Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)

About

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp