NotificationsYou must be signed in to change notification settings
Fork4.5k
Star41.6k

Core ML support#566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

ggerganov merged 5 commits intomasterfromcoreml

Apr 15, 2023

Merged

Core ML support#566

ggerganov merged 5 commits intomasterfromcoreml

Apr 15, 2023

Conversation

Copy link

Member

ggerganov commentedMar 5, 2023•
edited
Loading

Running Whisper inference on Apple Neural Engine (ANE) via Core ML

This PR extendswhisper.cpp to run the Whisper Encoder on the ANE through Core ML inference.
The performance gain is more than x3 compared to 8-thread CPU fortiny,base andsmall models.

Here are initial performance benchmarks for the Encoder on M1 Pro with (top) and without (bottom) Core ML:

CPU	OS	Config	Model	Th	Load [ms]	Encode [ms]	Commit
MacBook M1 Pro	MacOS 13.2.1	CORE ML	tiny	4	50	30	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	base	4	74	64	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	small	4	188	208	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	medium	4	533	1033	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	large	4	?	?	`b0ac915`
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	`206fc93`
---

This PR adds a helper scriptmodels/generate-coreml-model.sh that can be used to easily generate a Core ML Encoder model yourself. For now, I don't plan on hosting the Core ML models as there is some chance that the implementation will change in the future. Therefore, it is recommended that everyone simply generate them locally with that script. See the instructions below.

There are a couple of drawbacks:

First time running a Core ML model on a device takes a long time (several seconds, depending on the model).
All follow-up runs are fast
Themedium andlarge models take a long time to be converted to Core ML (tens of minutes) and require a lot of RAM. First run on a device is also very slow for them, so not sure if these are viable for production use

Acknowledgements

Huge thanks to@wangchou for the initial demonstration of how to use Core ML inwhisper.cpp (#548)

Thanks to@RobertRiachi for optimizing for ANE execution and improving the model export process

Thanks to everyone else who participated in#548 and helped with insights, testing and ideas

Usage

Install dependencies:

pip install ane_transformerspip install openai-whisperpip install coremltools

Generate a Core ML model. For example, to generate abase.en model, use:
```
./models/generate-coreml-model.sh base.en
```
This will generate the foldermodels/ggml-base.en-encoder.mlmodelc

Buildwhisper.cpp with Core ML support:

# using Makefilemake cleanWHISPER_COREML=1 make -j# using CMakecd buildcmake -DWHISPER_COREML=1 ..

Run the examples as usual. For example:

./main -m models/ggml-base.en.bin -f samples/gb0.wav...whisper_init_state: loading Core ML model from'models/ggml-base.en-encoder.mlmodelc'whisper_init_state: first run on a device may take awhile ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 10| AVX = 0| AVX2 = 0| AVX512 = 0| FMA = 0| NEON = 1| ARM_FMA = 1| F16C = 0| FP16_VA = 1| WASM_SIMD = 0| BLAS = 1| SSE3 = 0| VSX = 0| COREML = 1| ...

The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
Next runs are faster.

TODO

Can the Decoder be ported to ANE too?Run encoder on Apple Neural Engine #548 (reply in thread)
Answer: Yes, but it is slow
Convert themedium andlarge models to Core ML format and upload to HF
Need a Mac Silicon with 64GB RAM to do the conversion from PyTorch -> Core ML
Does not seem viable - too slow
Unifiedggml +coreml model file
We currently load both the fullggml model (encoder + decoder) and thecoreml encoder - not optimal
Will be done in the future, hopefully via community contributions
Scripts for generating Core ML model files (e.g.https://github.com/wangchou/callCoreMLFromCpp)
Support loading Core ML model from memory buffer
Currently we support only loading from a folder on the disk
Low-prio, hoping for contributions
Progress report for initial-run model processing
Does not look possible. Any CoreML experts?
Adjust memory usage buffers when using Core ML
Not needed - the Encoder compute buffer is less than 20MB even for thelarge model
Try to avoid the first on-device automatic model generation (it takes a long time)
Does not look possible. Any CoreML experts?
Themedium model takes more than 30 minutes to convert on the first run. Is there a work-around?
I think no
Can we run the Core ML inference on the GPU?
Looks like not worth it

Future work

Fix the ANE-optimized Whisper imeplementation. Currently, there is something wrong with the tensor shapes when passed to from / towhisper.cpp and the transcription gets corrupted. The optimized version should be about 1.5x faster than the original one
Add support for decoder-onlyggml models. This will avoid having to store the Encoder data 2 times on disk / memory. Currently, it is store one time in theggml model and another time in the Core ML model. This will reduce both disk and memory usage
Add support for running the Decoder on the ANE. Due to the nature of the Decoder operations, it seems that running on the CPU is generally more efficient in terms of speed compared to running it on the ANE. However, an ANE Decoder should be much more energy-efficient compared to the CPU one, so having this option could be useful in some cases

Copy link

brozkrut commentedMar 6, 2023

Great work!

I tested coreml branch on Mac Mini M2 (base $599 model).

The performance gain seems to be more thanx5 compared to 4-thread CPU (thanks to much faster ANE on M2, 8-thread CPU on Mac Mini M2 base model is slower than 4-thread).

Performance benchmarks for the Encoder with(top) and without(bottom) Core ML:

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Mac Mini M2	macOS 13.2.1	CORE ML	tiny	4	44	25	`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	base	4	66	54	`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	small	4	163	190	`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	medium	4			`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	large	4			`17a1459`

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Mac Mini M2	macOS 13.2.1	NEON BLAS	tiny	4	40	142	`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	base	4	67	299	`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	small	4	152	980	`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	medium	4			`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	large	4			`59fdcd1`

Copy link

DontEatOreo commentedMar 9, 2023

I compiled whisper.cpp with coreml support usingmake as well I build the mlmodel but I'm getting an error

whisper_init_from_file: loading model from 'models/ggml-base.en.mlmodelc'whisper_model_load: loading modelwhisper_model_load: invalid model data (bad magic)whisper_init: failed to load modelerror: failed to initialize whisper context

Is there anything else I'm missing? 🤔

Copy link

MemberAuthor

ggerganov commentedMar 9, 2023

@DontEatOreo

On the command line, you still have to specify the non-coreml model:models/ggml-base.en.bin.
The code will automatically also load themodels/ggml-base.en.mlmodelc if it is present in the same folder.

Copy link

DontEatOreo commentedMar 9, 2023

@ggerganov Благодаря ти! I was very confused why it wasn't working even though I did everything right

scosman mentioned this pull request

Mar 9, 2023

Whisper Model moved to CoreMLscosman/voicebox#9

Open

aehlke mentioned this pull request

Mar 11, 2023

Implement Whisper ModelOpenDive/OpenAIKit#36

Closed

Copy link

dennislysenko commentedMar 22, 2023

This is great. Excited to see how this feature develops. Leveraging ANE would be huge, even more if the decoder was possible to port to it.

Copy link

strangelearning commentedMar 24, 2023

Just saw this was announced, is it useful?https://github.com/apple/ml-ane-transformers

Copy link

cerupcat commentedApr 5, 2023

@DontEatOreo
On the command line, you still have to specify the non-coreml model:models/ggml-base.en.bin. The code will automatically also load themodels/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

wangchou mentioned this pull request

Apr 14, 2023

Get an "incorrect audio shape" assert when I process the whisper modelswangchou/callCoreMLFromCppOrPython#1

Closed

Copy link

lucabeetz commentedApr 14, 2023

Hey, thanks for this awesome project! I am trying to run the whisper.objc example with CoreML but running into some issues. Has someone successfully done this and could guide me on how to set it up?

Copy link

MemberAuthor

ggerganov commentedApr 14, 2023

@DontEatOreo
On the command line, you still have to specify the non-coreml model:models/ggml-base.en.bin. The code will automatically also load themodels/ggml-base.en.mlmodelc if it is present in the same folder.
Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

The solution is to produce encoder-only CoreML model in one file and decoder-only standard model in another file. This is not very difficult to achieve, but supporting so many model files might get too difficult for me. So probably I will rely on someone helping out and demonstrating how this can be done, either as an example in this repo or in a fork.

ggerganov added3 commits

April 14, 2023 22:09

coreml : use Core ML encoder inference

b328977

coreml : simlpify whisper_encode + log messages

4f7963e

whisper : resolve rebase conflicts

73cd216

ggerganov force-pushed thecoreml branch from17a1459 to73cd216Compare

April 14, 2023 19:17

ggerganov marked this pull request as ready for review

April 14, 2023 19:27

Copy link

MemberAuthor

ggerganov commentedApr 14, 2023

This is getting almost ready to merge. I am hoping to do it tomorrow.

The most important part that currently needs testing is the creation of the CoreML models, following the instructions here:

#548 (reply in thread)

If you give this a try, please let us know the results and if you encountered any issues.
Also, lets us know if you used quantized or not-quantized CoreML models and what has been the experience.

I believe thattiny,base andsmall models should be supported, whilemedium andlarge seem to not be viable for this approach.

Copy link

aehlke commentedApr 14, 2023

1.4gb for medium sounds fine for users, but you're saying there are other limitations against it?

coreml : add scripts for CoreML model generation

28b3232

ggerganov force-pushed thecoreml branch fromb2e7ef6 to28b3232Compare

April 15, 2023 09:40

Copy link

MemberAuthor

ggerganov commentedApr 15, 2023

@aehlke The scripts for generating Core ML models, support all sizes, but on my M1 Pro, it takes very long time (i.e. more than half an hour) to generate themedium model. After that, the first run is also very slow. Next runs are about 2 times faster compared to CPU-only.

In any case, you can follow the instructions in this PR and see how it works on your device.

bench-all : recognize COREML flag

5fda9b1

ggerganov merged commit5e47e22 intomaster

Apr 15, 2023

ggerganov deleted the coreml branch

April 15, 2023 10:21

Copy link

neurostar commentedApr 15, 2023•
edited
Loading

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	tiny	4	41	31	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	base	4	59	57	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	small	4	147	195	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	medium	4	576	783	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	large	4	1196	2551	`f19e23f`

Great work!
It was consuming ~9.7GB (short peak 15.03GB) memory converting large model to ML model, it worked fine on 8GB Air.

Edit:
I measured time of COREML model conversion and first loading conversion time (second-first).

Model	COREML conv	First Loading conv (sec)
tiny	4.915	0.72
base	8.564	1.34
small	26.050	4.72
medium	1:35.85	15.57
large	3:43.32	35.10

Copy link

CarberryChai commentedApr 16, 2023

When running this script:

./models/generate-coreml-model.sh base.en

I got the error:

xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH

Copy link

flexchar commentedApr 16, 2023

Is it me or the link of CoreML models is missing on Hugging Face?

Btw,@ggerganov, if you need help converting the models, I'd be glad to contribute. It seems to me that it only needs to be done once. :)

Copy link

MemberAuthor

ggerganov commentedApr 16, 2023

For now, you should generate the Core ML models locally following the instructions.
I don't want to host them on HF yet, because it is very likely that the models will change soon - there are a some pending improvements (see#548 (reply in thread)). If I upload them now, later we will get new models and everyone will be confused which model they are using, etc.

Copy link

flexchar commentedApr 16, 2023

In that regard I'd like to ask for help since I cant seem to succeed with it..

python3.10 ./models/convert-whisper-to-coreml.py --model tiny

100%|█████████████████████████████████████| 72.1M/72.1M [00:05<00:00, 14.3MiB/s]ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the'trunc' function NOT'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  scale = (n_state // self.n_head) ** -0.25Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 367/368 [00:00<00:00, 6681.50 ops/s]Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1047.63 passes/s]Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 147.77 passes/s]Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2599.51 passes/s]Traceback (most recent call last):  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 331, in <module>    decoder = convert_decoder(hparams, decoder, quantize=args.quantize)  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 283, in convert_decoder    traced_model = torch.jit.trace(model, (token_data, audio_data))  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 741, in trace    return trace_module(  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 958, in trace_module    module._c._create_method_from_trace(  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 211, in forward    x = block(x, xa, mask=self.mask, kv_cache=kv_cache)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 138, in forward    x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 83, in forward    k = self.key(x if xa is None else xa)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 37, in forward    return F.linear(RuntimeError: mat1 and mat2 shapes cannot be multiplied (384x1500 and 384x384)

Copy link

ganqqwerty commentedSep 2, 2023

These are stuck forever on M1 64G. I waited for 12 hours but still got no more messages. MacOS 13.5 (22G74).

whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...

Copy link

artemgordinskiy commentedSep 2, 2023•
edited
Loading

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as@ganqqwerty:

Built withWHISPER_COREML=1 make -j.
Downloaded the large CoreML model fromHuggingface
Ran a sample overnight (~11 hours):

~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'whisper_model_load: loading modelwhisper_model_load: n_vocab       = 51865whisper_model_load: n_audio_ctx   = 1500whisper_model_load: n_audio_state = 1280whisper_model_load: n_audio_head  = 20whisper_model_load: n_audio_layer = 32whisper_model_load: n_text_ctx    = 448whisper_model_load: n_text_state  = 1280whisper_model_load: n_text_head   = 20whisper_model_load: n_text_layer  = 32whisper_model_load: n_mels        = 80whisper_model_load: ftype         = 1whisper_model_load: qntvr         = 0whisper_model_load: type          = 5whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)whisper_model_load: adding 1608 extra tokenswhisper_model_load: model ctx     = 2951.27 MBwhisper_model_load: model size    = 2950.66 MBwhisper_init_state: kv self size  =   70.00 MBwhisper_init_state: kv cross size =  234.38 MBwhisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.whisper_print_timings:     load time =   985.39 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.84 mswhisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)whisper_print_timings:    total time = 40924196.00 ms

And the consecutive runs go much faster now, with the model loading in just a few seconds:

whisper_print_timings:     load time =  1141.81 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.75 mswhisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

Copy link

cust0mphase commentedSep 2, 2023

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as@ganqqwerty:

Built withWHISPER_COREML=1 make -j.
Downloaded the large CoreML model fromHuggingface
Ran a sample overnight (~11 hours):

~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'whisper_model_load: loading modelwhisper_model_load: n_vocab       = 51865whisper_model_load: n_audio_ctx   = 1500whisper_model_load: n_audio_state = 1280whisper_model_load: n_audio_head  = 20whisper_model_load: n_audio_layer = 32whisper_model_load: n_text_ctx    = 448whisper_model_load: n_text_state  = 1280whisper_model_load: n_text_head   = 20whisper_model_load: n_text_layer  = 32whisper_model_load: n_mels        = 80whisper_model_load: ftype         = 1whisper_model_load: qntvr         = 0whisper_model_load: type          = 5whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)whisper_model_load: adding 1608 extra tokenswhisper_model_load: model ctx     = 2951.27 MBwhisper_model_load: model size    = 2950.66 MBwhisper_init_state: kv self size  =   70.00 MBwhisper_init_state: kv cross size =  234.38 MBwhisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.whisper_print_timings:     load time =   985.39 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.84 mswhisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)whisper_print_timings:    total time = 40924196.00 ms

And the consecutive runs go much faster now, with the model loading in just a few seconds:

whisper_print_timings:     load time =  1141.81 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.75 mswhisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)whisper_print_timings:    total time =  6655.50 ms

Can you upload it, please?

Copy link

artemgordinskiy commentedSep 2, 2023

@cust0mphase Upload what? The CoreML model link is in my comment above, and as far as I can see, the files have not changed since I downloaded them.

Copy link

ganqqwerty commentedSep 3, 2023•
edited
Loading

I confirm that it works well with model from hugging face (of course, i use large). The performance boost in ventura 13.5 (22G74) is not that big, maybe 20%, but it's definitely faster. can't wait when the new OS come out.

Copy link

dhwkdjwndjwjjn commentedSep 18, 2023

Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?

Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

Copy link

dhwkdjwndjwjjn commentedSep 18, 2023

Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?
Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.
Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.
Thanks, great work for the author/authors of whisper c++!

Ok I just found out how to do it from other's discussion....

You can set it in file coreml/whisper-encoder.mm

And as for running Core ML with ./stream, you just need to run:

make clean
WHISPER_COREML=1 make stream -j

and then you can just ran ./stream normally and Core ML model will be loaded.

Copy link

dreampuf commentedSep 19, 2023

FYI: comparing with CPU+GPU vs. CPU+ANE:

# CPU + GPUwhisper_print_timings:     load time =   185.77 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   729.95 mswhisper_print_timings:   sample time =  3544.57 ms /  8631 runs (    0.41 ms per run)whisper_print_timings:   encode time =  8853.00 ms /    49 runs (  180.67 ms per run)whisper_print_timings:   decode time = 50679.41 ms /  8576 runs (    5.91 ms per run)whisper_print_timings:   prompt time =  1938.64 ms /    52 runs (   37.28 ms per run)whisper_print_timings:    total time = 66302.43 ms## second-timewhisper_print_timings:     load time =   306.99 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   666.95 mswhisper_print_timings:   sample time =  3934.44 ms /  8631 runs (    0.46 ms per run)whisper_print_timings:   encode time =  7717.25 ms /    49 runs (  157.49 ms per run)whisper_print_timings:   decode time = 51892.14 ms /  8576 runs (    6.05 ms per run)whisper_print_timings:   prompt time =  1951.12 ms /    52 runs (   37.52 ms per run)whisper_print_timings:    total time = 67378.17 ms

# CPU + ANEwhisper_print_timings:     load time =   426.37 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   655.52 mswhisper_print_timings:   sample time =  4105.80 ms /  9129 runs (    0.45 ms per run)whisper_print_timings:   encode time = 10249.34 ms /    48 runs (  213.53 ms per run)whisper_print_timings:   decode time = 55378.71 ms /  9073 runs (    6.10 ms per run)whisper_print_timings:   prompt time =  1981.35 ms /    52 runs (   38.10 ms per run)whisper_print_timings:    total time = 73484.55 ms

# CPU + ALLwhisper_print_timings:     load time =   328.41 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   699.48 mswhisper_print_timings:   sample time =  4050.11 ms /  9129 runs (    0.44 ms per run)whisper_print_timings:   encode time = 10222.64 ms /    48 runs (  212.97 ms per run)whisper_print_timings:   decode time = 54836.89 ms /  9073 runs (    6.04 ms per run)whisper_print_timings:   prompt time =  1984.60 ms /    52 runs (   38.17 ms per run)whisper_print_timings:    total time = 72802.16 ms

Copy link

astrowonk commentedOct 1, 2023•
edited
Loading

I don't have precise before/after numbers, but CoreML Whisper sure seems a lot faster on Sonoma. Not just the "first run on a device may take a while …" step which is almost instant now, but the actual encoding seems better?

Maybe this is something improved in the latest versions of Whisper.cpp itself but it runs at close to 100% GPU usage now which I don't remember if that was always the case. ~5x faster than realtime with themedium.en model on my lowly regular M1.

Copy link

dreampuf commentedOct 9, 2023

Here is an update after Sonoma.

# CPU + GPUwhisper_print_timings:     load time =   298.31 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   687.01 mswhisper_print_timings:   sample time =  3626.06 ms /  8863 runs (    0.41 ms per run)whisper_print_timings:   encode time =  9034.63 ms /    48 runs (  188.22 ms per run)whisper_print_timings:   decode time = 52123.91 ms /  8810 runs (    5.92 ms per run)whisper_print_timings:   prompt time =  1883.27 ms /    51 runs (   36.93 ms per run)whisper_print_timings:    total time = 69305.77 msggml_metal_free: deallocating# 2rd roundwhisper_print_timings:     load time =   220.71 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   659.20 mswhisper_print_timings:   sample time =  3607.61 ms /  8863 runs (    0.41 ms per run)whisper_print_timings:   encode time =  7268.91 ms /    48 runs (  151.44 ms per run)whisper_print_timings:   decode time = 52101.25 ms /  8810 runs (    5.91 ms per run)whisper_print_timings:   prompt time =  1880.41 ms /    51 runs (   36.87 ms per run)whisper_print_timings:    total time = 66078.09 ms

# CPU + ANEwhisper_print_timings:     load time =   290.60 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   674.62 mswhisper_print_timings:   sample time =  3722.67 ms /  9019 runs (    0.41 ms per run)whisper_print_timings:   encode time = 10463.12 ms /    48 runs (  217.98 ms per run)whisper_print_timings:   decode time = 52677.20 ms /  8963 runs (    5.88 ms per run)whisper_print_timings:   prompt time =  1935.95 ms /    52 runs (   37.23 ms per run)whisper_print_timings:    total time = 105001.48 ms# 2rd roundwhisper_print_timings:     load time =   218.93 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   647.12 mswhisper_print_timings:   sample time =  3874.24 ms /  9019 runs (    0.43 ms per run)whisper_print_timings:   encode time = 10568.01 ms /    48 runs (  220.17 ms per run)whisper_print_timings:   decode time = 53258.39 ms /  8963 runs (    5.94 ms per run)whisper_print_timings:   prompt time =  1956.66 ms /    52 runs (   37.63 ms per run)whisper_print_timings:    total time = 70788.73 ms

# CPU + ANE + GPUwhisper_print_timings:     load time =   203.14 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   679.72 mswhisper_print_timings:   sample time =  3868.27 ms /  9019 runs (    0.43 ms per run)whisper_print_timings:   encode time = 10651.40 ms /    48 runs (  221.90 ms per run)whisper_print_timings:   decode time = 53248.52 ms /  8963 runs (    5.94 ms per run)whisper_print_timings:   prompt time =  1942.67 ms /    52 runs (   37.36 ms per run)whisper_print_timings:    total time = 105808.82 ms# 2rd roundwhisper_print_timings:     load time =   223.98 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   650.97 mswhisper_print_timings:   sample time =  3727.37 ms /  9019 runs (    0.41 ms per run)whisper_print_timings:   encode time = 10526.05 ms /    48 runs (  219.29 ms per run)whisper_print_timings:   decode time = 53171.40 ms /  8963 runs (    5.93 ms per run)whisper_print_timings:   prompt time =  1950.87 ms /    52 runs (   37.52 ms per run)whisper_print_timings:    total time = 70573.20 ms

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request

Oct 24, 2023

whisper : add Core ML support (ggml-org#566)

e939dc3

* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request

Oct 24, 2023

whisper : add Core ML support (ggml-org#566)

cbb9e09

* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag

astrowonk mentioned this pull request

Dec 9, 2023

Utilise MLX framework on Apple Silicon#1598

Open

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request

Dec 16, 2023

whisper : add Core ML support (ggml-org#566)

7fbcc05

* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag

Copy link

helins commentedFeb 15, 2024

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instancestream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

Copy link

astrowonk commentedFeb 15, 2024•
edited
Loading

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instancestream became unusable, both slow and inaccurate.
I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

Did you try the same model twice? There is still a considerable delay for me the first time the CoreML models run, but it is alittle faster than the standard build for me after that. However I see very little ANE usage when I compile for CoreML, it's almost all GPU for me.

Copy link

helins commentedFeb 15, 2024

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model withstream for live transcription.

Copy link

astrowonk commentedFeb 15, 2024

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model withstream for live transcription.

I'm not 100% sure but afterthis PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

Copy link

RazeBerry commentedMar 18, 2024

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model withstream for live transcription.
I'm not 100% sure but afterthis PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

Just converted myself and took about 10 mins on M2 Pro + 16GB ram

Copy link

shell1986 commentedMar 27, 2024

My model does not start, it just says that it does not find the file. Although the model is compiled and is in the folder.

Copy link

sahmed53 commentedMay 18, 2024•
edited
Loading

I have posted this in the main issues section too (i apologise for the double post), but think maybe here people might be able to reply since it is a specific CoreML thread. My problem is about using CoreML in iOS apps i have noticed that the size of the app jump dramatically everytime coreML is fired up. Downloading the app container in xcode doesnt seem to show why the "documents and data" increases to many mb and sometimes gb with repeated usage. So i was wondering if anyone here has used the Objective-C sample or similar, can they check the app size after running - setting -> general -> storage? where could the app be saving coreml files? what could be going on?

This only happens with CoreML not Metal
I have cleared caches and temp files, but it doesnt effect the documents and data
xcode container doesnt not equal the same size as the settings indicator
I have looked in instruments but can't find the directory where files are being written to, it shows a tmp folder being written to with ANE weights?

This issue means it can't be deployed in production ready apps?

Please someone help!

DainisGorbunovs mentioned this pull request

Jun 27, 2024

Unable to generate large-v3 quantized coreml model#2042

Open

Copy link

day-dreaming-guy commentedJul 28, 2024•
edited
Loading

Hey@sahmed53 ! Have you solved it?

Copy link

Contributor

bjnortier commentedJul 29, 2024

When you load a CoreML model the first time it does an optimisation and it saves that optimised model somewhere. I could never figure out where – it is something internal and hidden. I suspect that's what you're seeing. Sometimes the OS will delete those files (I assume when storage is low) and then when you load the CoreML model again it will do the optimisation step again. This can take very long on some devices.

This is why I've stopped using CoreML for my app and I only use the Metal version.

Copy link

aehlke commentedJul 29, 2024•
edited
Loading

@bjnortier does WhisperKit suffer from the same issue? they became quite popular and rely on CoreML rather than Metal

Copy link

Contributor

bjnortier commentedJul 29, 2024

@aehlke Yes, if you use the WhisperKit macOS TestFlight app you will see "Specializing [...] for you device... This can take several minutes on first load"

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request

Sep 23, 2024

whisper : add Core ML support (ggml-org#566)

3415676

* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag

Copy link

androslaforc commentedNov 26, 2024

Hi, which version of Python should I use to install these dependencies? I tried 3.11 and 3.10, but failed to install all dependcecies.
pip install ane_transformerspip install openai-whisperpip install coremltools

have you got any answer about this ?

Copy link

androslaforc commentedNov 26, 2024

Hey, for people who are still struggling with trying this out, here is a bit of a script that shows what my environment is like. I've had a bit of trouble getting this running, so here is my attempt to make a minimally reproducible set of commands. I haven't checked speed improvements yet. This is on a macbook air M1, running Ventura 13.3.1

# I also had problems with xcode, this helped me when it didn't work after reinstalling xcode, thanks @neurostarsudo xcode-select --switch /Applications/Xcode.app/Contents/Developer# setting up coreml conda environment, 3.9 works for meconda create --name core_whisper_3_9 python=3.9 -y conda activate core_whisper_3_9# torchvision and tensorflow aren't necessary, but I've had a bit of a problem getting coreml itself to work, so this is to run their example programpip install torchvision==0.15.1 tensorflow-macos==2.9 coremltools==6.3 ane-transformers==0.1.1 openai-whisper==20230314# checking whether coreml itself works correctly, no whisper.cpp involvement, code copied from [coremltools.readme.io](https://coremltools.readme.io/docs/convert-a-torchvision-model-from-pytorch)python -c'import coremltools as ctimport torchimport torchvision# Load PyTorch model (and perform tracing)torch_model = torchvision.models.mobilenet_v2()torch_model.eval()example_input = torch.rand(1, 3, 256, 256)traced_model = torch.jit.trace(torch_model, example_input)# Convert using the same API. Note that we need to provide "inputs" for pytorch conversion.model_from_torch = ct.convert(traced_model,                              inputs=[ct.TensorType(name="input",                                                    shape=example_input.shape)],                              debug=True)'# ensure whisper.cpp repo is in the same state as mine wascd<mypath>/whisper.cppgit pullgit checkout v1.3.0git clean -idx# interactive: remove downloaded models. you can skip this step, but it ensures you have a fresh install of the models# building whisper.cpp with coreml supportmkdir buildcd buildcmake -DWHISPER_COREML=1 ..make -jcd ..# download model and convert to coremlbash ./models/download-ggml-model.sh tinybash ./models/generate-coreml-model.sh tiny# try out jfk sample./build/bin/main -m models/ggml-tiny.bin -f samples/jfk.wav

This led to the following output for the sample for me:

whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'whisper_model_load: loading modelwhisper_model_load: n_vocab       = 51865whisper_model_load: n_audio_ctx   = 1500whisper_model_load: n_audio_state = 384whisper_model_load: n_audio_head  = 6whisper_model_load: n_audio_layer = 4whisper_model_load: n_text_ctx    = 448whisper_model_load: n_text_state  = 384whisper_model_load: n_text_head   = 6whisper_model_load: n_text_layer  = 4whisper_model_load: n_mels        = 80whisper_model_load: f16           = 1whisper_model_load: type          = 1whisper_model_load: mem required  =  129.00 MB (+    3.00 MB per decoder)whisper_model_load: adding 1608 extra tokenswhisper_model_load: model ctx     =   73.58 MBwhisper_model_load: model size    =   73.54 MBwhisper_init_state: kv self size  =    2.62 MBwhisper_init_state: kv cross size =    8.79 MBwhisper_init_state: loading Core ML model from 'models/ggml-tiny-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.whisper_print_timings:     load time =    97.74 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    41.69 mswhisper_print_timings:   sample time =    10.67 ms /    25 runs (    0.43 ms per run)whisper_print_timings:   encode time =    33.32 ms /     1 runs (   33.32 ms per run)whisper_print_timings:   decode time =    38.86 ms /    25 runs (    1.55 ms per run)whisper_print_timings:    total time =  4314.09 ms

Hope this helps someone.

Thanks a lot bro it works i was stuck i love you