Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Core ML support#566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
ggerganov merged 5 commits intomasterfromcoreml
Apr 15, 2023
Merged

Core ML support#566

ggerganov merged 5 commits intomasterfromcoreml
Apr 15, 2023

Conversation

ggerganov
Copy link
Member

@ggerganovggerganov commentedMar 5, 2023
edited
Loading

Running Whisper inference on Apple Neural Engine (ANE) via Core ML

This PR extendswhisper.cpp to run the Whisper Encoder on the ANE through Core ML inference.
The performance gain is more than x3 compared to 8-thread CPU fortiny,base andsmall models.

Here are initial performance benchmarks for the Encoder on M1 Pro with (top) and without (bottom) Core ML:

CPUOSConfigModelThLoad [ms]Encode [ms]Commit
MacBook M1 ProMacOS 13.2.1CORE MLtiny45030b0ac915
MacBook M1 ProMacOS 13.2.1CORE MLbase47464b0ac915
MacBook M1 ProMacOS 13.2.1CORE MLsmall4188208b0ac915
MacBook M1 ProMacOS 13.2.1CORE MLmedium45331033b0ac915
MacBook M1 ProMacOS 13.2.1CORE MLlarge4??b0ac915
---
MacBook M1 ProMacOS 13.0.1NEON BLAStiny871102206fc93
MacBook M1 ProMacOS 13.0.1NEON BLASbase896220206fc93
MacBook M1 ProMacOS 13.0.1NEON BLASsmall8233685206fc93
MacBook M1 ProMacOS 13.0.1NEON BLASmedium86031928206fc93
MacBook M1 ProMacOS 13.0.1NEON BLASlarge811583350206fc93
---

This PR adds a helper scriptmodels/generate-coreml-model.sh that can be used to easily generate a Core ML Encoder model yourself. For now, I don't plan on hosting the Core ML models as there is some chance that the implementation will change in the future. Therefore, it is recommended that everyone simply generate them locally with that script. See the instructions below.

There are a couple of drawbacks:

  • First time running a Core ML model on a device takes a long time (several seconds, depending on the model).
    All follow-up runs are fast
  • Themedium andlarge models take a long time to be converted to Core ML (tens of minutes) and require a lot of RAM. First run on a device is also very slow for them, so not sure if these are viable for production use

Acknowledgements

Huge thanks to@wangchou for the initial demonstration of how to use Core ML inwhisper.cpp (#548)

Thanks to@RobertRiachi for optimizing for ANE execution and improving the model export process

Thanks to everyone else who participated in#548 and helped with insights, testing and ideas

Usage

  • Install dependencies:

    pip install ane_transformerspip install openai-whisperpip install coremltools
  • Generate a Core ML model. For example, to generate abase.en model, use:

    ./models/generate-coreml-model.sh base.en

    This will generate the foldermodels/ggml-base.en-encoder.mlmodelc

  • Buildwhisper.cpp with Core ML support:

    # using Makefilemake cleanWHISPER_COREML=1 make -j# using CMakecd buildcmake -DWHISPER_COREML=1 ..
  • Run the examples as usual. For example:

    ./main -m models/ggml-base.en.bin -f samples/gb0.wav...whisper_init_state: loading Core ML model from'models/ggml-base.en-encoder.mlmodelc'whisper_init_state: first run on a device may take awhile ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 10| AVX = 0| AVX2 = 0| AVX512 = 0| FMA = 0| NEON = 1| ARM_FMA = 1| F16C = 0| FP16_VA = 1| WASM_SIMD = 0| BLAS = 1| SSE3 = 0| VSX = 0| COREML = 1| ...

    The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
    Next runs are faster.

TODO

  • Can the Decoder be ported to ANE too?Run encoder on Apple Neural Engine #548 (reply in thread)
    Answer: Yes, but it is slow
  • Convert themedium andlarge models to Core ML format and upload to HF
    Need a Mac Silicon with 64GB RAM to do the conversion from PyTorch -> Core ML
    Does not seem viable - too slow
  • Unifiedggml +coreml model file
    We currently load both the fullggml model (encoder + decoder) and thecoreml encoder - not optimal
    Will be done in the future, hopefully via community contributions
  • Scripts for generating Core ML model files (e.g.https://github.com/wangchou/callCoreMLFromCpp)
  • Support loading Core ML model from memory buffer
    Currently we support only loading from a folder on the disk
    Low-prio, hoping for contributions
  • Progress report for initial-run model processing
    Does not look possible. Any CoreML experts?
  • Adjust memory usage buffers when using Core ML
    Not needed - the Encoder compute buffer is less than 20MB even for thelarge model
  • Try to avoid the first on-device automatic model generation (it takes a long time)
    Does not look possible. Any CoreML experts?
  • Themedium model takes more than 30 minutes to convert on the first run. Is there a work-around?
    I think no
  • Can we run the Core ML inference on the GPU?
    Looks like not worth it

Future work

  • Fix the ANE-optimized Whisper imeplementation. Currently, there is something wrong with the tensor shapes when passed to from / towhisper.cpp and the transcription gets corrupted. The optimized version should be about 1.5x faster than the original one
  • Add support for decoder-onlyggml models. This will avoid having to store the Encoder data 2 times on disk / memory. Currently, it is store one time in theggml model and another time in the Core ML model. This will reduce both disk and memory usage
  • Add support for running the Decoder on the ANE. Due to the nature of the Decoder operations, it seems that running on the CPU is generally more efficient in terms of speed compared to running it on the ANE. However, an ANE Decoder should be much more energy-efficient compared to the CPU one, so having this option could be useful in some cases

jghaines, DaveKT, lin72h, eugenepyvovarov, khiner, and qzchenwl reacted with hooray emojilatenitefilms, martell, kamranjon, rsomani95, migueltarga, szeidner, brozkrut, DontEatOreo, dy-kim, MatteoGauthier, and 31 more reacted with heart emojijordibruin, DontEatOreo, linden, DavidGOrtega, voladelta, scosman, lin72h, fwcd, bocytko, rounakdatta, and 5 more reacted with rocket emojiwzxu reacted with eyes emoji
@brozkrut
Copy link

Great work!

I tested coreml branch on Mac Mini M2 (base $599 model).

The performance gain seems to be more thanx5 compared to 4-thread CPU (thanks to much faster ANE on M2, 8-thread CPU on Mac Mini M2 base model is slower than 4-thread).

Performance benchmarks for the Encoder with(top) and without(bottom) Core ML:

CPUOSConfigModelThLoadEnc.Commit
Mac Mini M2macOS 13.2.1CORE MLtiny4442517a1459
Mac Mini M2macOS 13.2.1CORE MLbase4665417a1459
Mac Mini M2macOS 13.2.1CORE MLsmall416319017a1459
Mac Mini M2macOS 13.2.1CORE MLmedium417a1459
Mac Mini M2macOS 13.2.1CORE MLlarge417a1459

CPUOSConfigModelThLoadEnc.Commit
Mac Mini M2macOS 13.2.1NEON BLAStiny44014259fdcd1
Mac Mini M2macOS 13.2.1NEON BLASbase46729959fdcd1
Mac Mini M2macOS 13.2.1NEON BLASsmall415298059fdcd1
Mac Mini M2macOS 13.2.1NEON BLASmedium459fdcd1
Mac Mini M2macOS 13.2.1NEON BLASlarge459fdcd1

ggerganov, kamranjon, szeidner, DontEatOreo, shvul, lin72h, wzxu, alecmerdler, pokey, JohannesHa, and 2 more reacted with rocket emoji

@DontEatOreo
Copy link

I compiled whisper.cpp with coreml support usingmake as well I build the mlmodel but I'm getting an error

whisper_init_from_file: loading model from 'models/ggml-base.en.mlmodelc'whisper_model_load: loading modelwhisper_model_load: invalid model data (bad magic)whisper_init: failed to load modelerror: failed to initialize whisper context

Is there anything else I'm missing? 🤔

rben01, haox1982, and SHi-ON reacted with eyes emoji

@ggerganov
Copy link
MemberAuthor

@DontEatOreo

On the command line, you still have to specify the non-coreml model:models/ggml-base.en.bin.
The code will automatically also load themodels/ggml-base.en.mlmodelc if it is present in the same folder.

msuch, chenrq2005, rben01, and SHi-ON reacted with thumbs up emoji

@DontEatOreo
Copy link

@ggerganov Благодаря ти! I was very confused why it wasn't working even though I did everything right

@dennislysenko
Copy link

This is great. Excited to see how this feature develops. Leveraging ANE would be huge, even more if the decoder was possible to port to it.

gut4, lin72h, aehlke, eugenepyvovarov, jonn26, and z11h reacted with thumbs up emoji

@strangelearning
Copy link

Just saw this was announced, is it useful?https://github.com/apple/ml-ane-transformers

cerupcat, janekb04, z11h, wangqun888, and kexiong1962 reacted with eyes emoji

@cerupcat
Copy link

@DontEatOreo

On the command line, you still have to specify the non-coreml model:models/ggml-base.en.bin. The code will automatically also load themodels/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

rben01 and PierreLouisLetoquart reacted with eyes emoji

@lucabeetz
Copy link

Hey, thanks for this awesome project! I am trying to run the whisper.objc example with CoreML but running into some issues. Has someone successfully done this and could guide me on how to set it up?

lin72h, MostafaSamirTWI, kimilee7788, and furqan4545 reacted with eyes emoji

@ggerganov
Copy link
MemberAuthor

@DontEatOreo
On the command line, you still have to specify the non-coreml model:models/ggml-base.en.bin. The code will automatically also load themodels/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

The solution is to produce encoder-only CoreML model in one file and decoder-only standard model in another file. This is not very difficult to achieve, but supporting so many model files might get too difficult for me. So probably I will rely on someone helping out and demonstrating how this can be done, either as an example in this repo or in a fork.

@ggerganovggerganov marked this pull request as ready for reviewApril 14, 2023 19:27
@ggerganov
Copy link
MemberAuthor

This is getting almost ready to merge. I am hoping to do it tomorrow.

The most important part that currently needs testing is the creation of the CoreML models, following the instructions here:

#548 (reply in thread)

If you give this a try, please let us know the results and if you encountered any issues.
Also, lets us know if you used quantized or not-quantized CoreML models and what has been the experience.

I believe thattiny,base andsmall models should be supported, whilemedium andlarge seem to not be viable for this approach.

lin72h, jordibruin, and 2a588 reacted with thumbs up emojiflexchar reacted with heart emoji

@aehlke
Copy link

1.4gb for medium sounds fine for users, but you're saying there are other limitations against it?

lin72h and rben01 reacted with eyes emoji

@ggerganov
Copy link
MemberAuthor

@aehlke The scripts for generating Core ML models, support all sizes, but on my M1 Pro, it takes very long time (i.e. more than half an hour) to generate themedium model. After that, the first run is also very slow. Next runs are about 2 times faster compared to CPU-only.

In any case, you can follow the instructions in this PR and see how it works on your device.

lin72h, aehlke, zenyr, Phishion, Kerub88, and BingeCode reacted with thumbs up emoji

@ggerganovggerganov merged commit5e47e22 intomasterApr 15, 2023
@ggerganovggerganov deleted the coreml branchApril 15, 2023 10:21
@neurostar
Copy link

neurostar commentedApr 15, 2023
edited
Loading

CPUOSConfigModelThLoadEnc.Commit
MacBook Air M2MacOS 13.3.1NEON BLAS COREMLtiny44131f19e23f
MacBook Air M2MacOS 13.3.1NEON BLAS COREMLbase45957f19e23f
MacBook Air M2MacOS 13.3.1NEON BLAS COREMLsmall4147195f19e23f
MacBook Air M2MacOS 13.3.1NEON BLAS COREMLmedium4576783f19e23f
MacBook Air M2MacOS 13.3.1NEON BLAS COREMLlarge411962551f19e23f

Great work!
It was consuming ~9.7GB (short peak 15.03GB) memory converting large model to ML model, it worked fine on 8GB Air.

Edit:
I measured time of COREML model conversion and first loading conversion time (second-first).

ModelCOREML convFirst Loading conv (sec)
tiny4.9150.72
base8.5641.34
small26.0504.72
medium1:35.8515.57
large3:43.3235.10
ggerganov, clyang, flexchar, lin72h, DontEatOreo, Marksdo, cust0mphase, ziozzang, and songlairui reacted with thumbs up emoji

@CarberryChai
Copy link

When running this script:

./models/generate-coreml-model.sh base.en

I got the error:

xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH
NextDoorLaoHuang-HF, obwohl, and crsando reacted with eyes emoji

@flexchar
Copy link

Is it me or the link of CoreML models is missing on Hugging Face?

Btw,@ggerganov, if you need help converting the models, I'd be glad to contribute. It seems to me that it only needs to be done once. :)

@ggerganov
Copy link
MemberAuthor

For now, you should generate the Core ML models locally following the instructions.
I don't want to host them on HF yet, because it is very likely that the models will change soon - there are a some pending improvements (see#548 (reply in thread)). If I upload them now, later we will get new models and everyone will be confused which model they are using, etc.

lin72h and 2a588 reacted with thumbs up emoji

@flexchar
Copy link

In that regard I'd like to ask for help since I cant seem to succeed with it..

python3.10 ./models/convert-whisper-to-coreml.py --model tiny

100%|█████████████████████████████████████| 72.1M/72.1M [00:05<00:00, 14.3MiB/s]ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the'trunc' function NOT'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  scale = (n_state // self.n_head) ** -0.25Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 367/368 [00:00<00:00, 6681.50 ops/s]Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1047.63 passes/s]Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 147.77 passes/s]Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2599.51 passes/s]Traceback (most recent call last):  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 331, in <module>    decoder = convert_decoder(hparams, decoder, quantize=args.quantize)  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 283, in convert_decoder    traced_model = torch.jit.trace(model, (token_data, audio_data))  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 741, in trace    return trace_module(  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 958, in trace_module    module._c._create_method_from_trace(  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 211, in forward    x = block(x, xa, mask=self.mask, kv_cache=kv_cache)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 138, in forward    x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 83, in forward    k = self.key(x if xa is None else xa)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl    return forward_call(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward    result = self.forward(*input, **kwargs)  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 37, in forward    return F.linear(RuntimeError: mat1 and mat2 shapes cannot be multiplied (384x1500 and 384x384)

@ganqqwerty
Copy link

These are stuck forever on M1 64G. I waited for 12 hours but still got no more messages. MacOS 13.5 (22G74).

whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...

@artemgordinskiy
Copy link

artemgordinskiy commentedSep 2, 2023
edited
Loading

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as@ganqqwerty:

  1. Built withWHISPER_COREML=1 make -j.
  2. Downloaded the large CoreML model fromHuggingface
  3. Ran a sample overnight (~11 hours):
~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'whisper_model_load: loading modelwhisper_model_load: n_vocab       = 51865whisper_model_load: n_audio_ctx   = 1500whisper_model_load: n_audio_state = 1280whisper_model_load: n_audio_head  = 20whisper_model_load: n_audio_layer = 32whisper_model_load: n_text_ctx    = 448whisper_model_load: n_text_state  = 1280whisper_model_load: n_text_head   = 20whisper_model_load: n_text_layer  = 32whisper_model_load: n_mels        = 80whisper_model_load: ftype         = 1whisper_model_load: qntvr         = 0whisper_model_load: type          = 5whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)whisper_model_load: adding 1608 extra tokenswhisper_model_load: model ctx     = 2951.27 MBwhisper_model_load: model size    = 2950.66 MBwhisper_init_state: kv self size  =   70.00 MBwhisper_init_state: kv cross size =  234.38 MBwhisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.whisper_print_timings:     load time =   985.39 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.84 mswhisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)whisper_print_timings:    total time = 40924196.00 ms
  1. And the consecutive runs go much faster now, with the model loading in just a few seconds:
whisper_print_timings:     load time =  1141.81 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.75 mswhisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

cust0mphase reacted with thumbs up emojiganqqwerty and BingeCode reacted with hooray emoji

@cust0mphase
Copy link

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as@ganqqwerty:

  1. Built withWHISPER_COREML=1 make -j.
  2. Downloaded the large CoreML model fromHuggingface
  3. Ran a sample overnight (~11 hours):
~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'whisper_model_load: loading modelwhisper_model_load: n_vocab       = 51865whisper_model_load: n_audio_ctx   = 1500whisper_model_load: n_audio_state = 1280whisper_model_load: n_audio_head  = 20whisper_model_load: n_audio_layer = 32whisper_model_load: n_text_ctx    = 448whisper_model_load: n_text_state  = 1280whisper_model_load: n_text_head   = 20whisper_model_load: n_text_layer  = 32whisper_model_load: n_mels        = 80whisper_model_load: ftype         = 1whisper_model_load: qntvr         = 0whisper_model_load: type          = 5whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)whisper_model_load: adding 1608 extra tokenswhisper_model_load: model ctx     = 2951.27 MBwhisper_model_load: model size    = 2950.66 MBwhisper_init_state: kv self size  =   70.00 MBwhisper_init_state: kv cross size =  234.38 MBwhisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.whisper_print_timings:     load time =   985.39 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.84 mswhisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)whisper_print_timings:    total time = 40924196.00 ms
  1. And the consecutive runs go much faster now, with the model loading in just a few seconds:
whisper_print_timings:     load time =  1141.81 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    35.75 mswhisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

Can you upload it, please?

@artemgordinskiy
Copy link

@cust0mphase Upload what? The CoreML model link is in my comment above, and as far as I can see, the files have not changed since I downloaded them.

@ganqqwerty
Copy link

ganqqwerty commentedSep 3, 2023
edited
Loading

I confirm that it works well with model from hugging face (of course, i use large). The performance boost in ventura 13.5 (22G74) is not that big, maybe 20%, but it's definitely faster. can't wait when the new OS come out.

artemgordinskiy reacted with thumbs up emoji

@dhwkdjwndjwjjn
Copy link

ANE

Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?

Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

aehlke reacted with laugh emojiaehlke and bebound reacted with eyes emoji

@dhwkdjwndjwjjn
Copy link

ANEHi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?

Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

Ok I just found out how to do it from other's discussion....

Screenshot 2023-09-18 at 12 41 36 PM

You can set it in file coreml/whisper-encoder.mm

And as for running Core ML with ./stream, you just need to run:

make clean
WHISPER_COREML=1 make stream -j

and then you can just ran ./stream normally and Core ML model will be loaded.

dreampuf, artemgordinskiy, jchen8tw, astrowonk, KosukeArase, bebound, NebulusIO, and RazeBerry reacted with thumbs up emoji

@dreampuf
Copy link

FYI: comparing with CPU+GPU vs. CPU+ANE:

# CPU + GPUwhisper_print_timings:     load time =   185.77 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   729.95 mswhisper_print_timings:   sample time =  3544.57 ms /  8631 runs (    0.41 ms per run)whisper_print_timings:   encode time =  8853.00 ms /    49 runs (  180.67 ms per run)whisper_print_timings:   decode time = 50679.41 ms /  8576 runs (    5.91 ms per run)whisper_print_timings:   prompt time =  1938.64 ms /    52 runs (   37.28 ms per run)whisper_print_timings:    total time = 66302.43 ms## second-timewhisper_print_timings:     load time =   306.99 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   666.95 mswhisper_print_timings:   sample time =  3934.44 ms /  8631 runs (    0.46 ms per run)whisper_print_timings:   encode time =  7717.25 ms /    49 runs (  157.49 ms per run)whisper_print_timings:   decode time = 51892.14 ms /  8576 runs (    6.05 ms per run)whisper_print_timings:   prompt time =  1951.12 ms /    52 runs (   37.52 ms per run)whisper_print_timings:    total time = 67378.17 ms
# CPU + ANEwhisper_print_timings:     load time =   426.37 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   655.52 mswhisper_print_timings:   sample time =  4105.80 ms /  9129 runs (    0.45 ms per run)whisper_print_timings:   encode time = 10249.34 ms /    48 runs (  213.53 ms per run)whisper_print_timings:   decode time = 55378.71 ms /  9073 runs (    6.10 ms per run)whisper_print_timings:   prompt time =  1981.35 ms /    52 runs (   38.10 ms per run)whisper_print_timings:    total time = 73484.55 ms
# CPU + ALLwhisper_print_timings:     load time =   328.41 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   699.48 mswhisper_print_timings:   sample time =  4050.11 ms /  9129 runs (    0.44 ms per run)whisper_print_timings:   encode time = 10222.64 ms /    48 runs (  212.97 ms per run)whisper_print_timings:   decode time = 54836.89 ms /  9073 runs (    6.04 ms per run)whisper_print_timings:   prompt time =  1984.60 ms /    52 runs (   38.17 ms per run)whisper_print_timings:    total time = 72802.16 ms
artemgordinskiy reacted with thumbs up emojiartemgordinskiy, ganqqwerty, jchen8tw, astrowonk, and jtrmal reacted with eyes emoji

@astrowonk
Copy link

astrowonk commentedOct 1, 2023
edited
Loading

I don't have precise before/after numbers, but CoreML Whisper sure seems a lot faster on Sonoma. Not just the "first run on a device may take a while …" step which is almost instant now, but the actual encoding seems better?

Maybe this is something improved in the latest versions of Whisper.cpp itself but it runs at close to 100% GPU usage now which I don't remember if that was always the case. ~5x faster than realtime with themedium.en model on my lowly regular M1.

artemgordinskiy, aehlke, and bebound reacted with thumbs up emoji

@dreampuf
Copy link

Here is an update after Sonoma.

# CPU + GPUwhisper_print_timings:     load time =   298.31 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   687.01 mswhisper_print_timings:   sample time =  3626.06 ms /  8863 runs (    0.41 ms per run)whisper_print_timings:   encode time =  9034.63 ms /    48 runs (  188.22 ms per run)whisper_print_timings:   decode time = 52123.91 ms /  8810 runs (    5.92 ms per run)whisper_print_timings:   prompt time =  1883.27 ms /    51 runs (   36.93 ms per run)whisper_print_timings:    total time = 69305.77 msggml_metal_free: deallocating# 2rd roundwhisper_print_timings:     load time =   220.71 mswhisper_print_timings:     fallbacks =   1 p /   0 hwhisper_print_timings:      mel time =   659.20 mswhisper_print_timings:   sample time =  3607.61 ms /  8863 runs (    0.41 ms per run)whisper_print_timings:   encode time =  7268.91 ms /    48 runs (  151.44 ms per run)whisper_print_timings:   decode time = 52101.25 ms /  8810 runs (    5.91 ms per run)whisper_print_timings:   prompt time =  1880.41 ms /    51 runs (   36.87 ms per run)whisper_print_timings:    total time = 66078.09 ms
# CPU + ANEwhisper_print_timings:     load time =   290.60 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   674.62 mswhisper_print_timings:   sample time =  3722.67 ms /  9019 runs (    0.41 ms per run)whisper_print_timings:   encode time = 10463.12 ms /    48 runs (  217.98 ms per run)whisper_print_timings:   decode time = 52677.20 ms /  8963 runs (    5.88 ms per run)whisper_print_timings:   prompt time =  1935.95 ms /    52 runs (   37.23 ms per run)whisper_print_timings:    total time = 105001.48 ms# 2rd roundwhisper_print_timings:     load time =   218.93 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   647.12 mswhisper_print_timings:   sample time =  3874.24 ms /  9019 runs (    0.43 ms per run)whisper_print_timings:   encode time = 10568.01 ms /    48 runs (  220.17 ms per run)whisper_print_timings:   decode time = 53258.39 ms /  8963 runs (    5.94 ms per run)whisper_print_timings:   prompt time =  1956.66 ms /    52 runs (   37.63 ms per run)whisper_print_timings:    total time = 70788.73 ms
# CPU + ANE + GPUwhisper_print_timings:     load time =   203.14 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   679.72 mswhisper_print_timings:   sample time =  3868.27 ms /  9019 runs (    0.43 ms per run)whisper_print_timings:   encode time = 10651.40 ms /    48 runs (  221.90 ms per run)whisper_print_timings:   decode time = 53248.52 ms /  8963 runs (    5.94 ms per run)whisper_print_timings:   prompt time =  1942.67 ms /    52 runs (   37.36 ms per run)whisper_print_timings:    total time = 105808.82 ms# 2rd roundwhisper_print_timings:     load time =   223.98 mswhisper_print_timings:     fallbacks =   2 p /   0 hwhisper_print_timings:      mel time =   650.97 mswhisper_print_timings:   sample time =  3727.37 ms /  9019 runs (    0.41 ms per run)whisper_print_timings:   encode time = 10526.05 ms /    48 runs (  219.29 ms per run)whisper_print_timings:   decode time = 53171.40 ms /  8963 runs (    5.93 ms per run)whisper_print_timings:   prompt time =  1950.87 ms /    52 runs (   37.52 ms per run)whisper_print_timings:    total time = 70573.20 ms

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull requestOct 24, 2023
* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull requestOct 24, 2023
* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull requestDec 16, 2023
* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag
@helins
Copy link

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instancestream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

@astrowonk
Copy link

astrowonk commentedFeb 15, 2024
edited
Loading

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instancestream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

Did you try the same model twice? There is still a considerable delay for me the first time the CoreML models run, but it is alittle faster than the standard build for me after that. However I see very little ANE usage when I compile for CoreML, it's almost all GPU for me.

@helins
Copy link

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model withstream for live transcription.

@astrowonk
Copy link

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model withstream for live transcription.

I'm not 100% sure but afterthis PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

@RazeBerry
Copy link

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model withstream for live transcription.

I'm not 100% sure but afterthis PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

Just converted myself and took about 10 mins on M2 Pro + 16GB ram

@shell1986
Copy link

My model does not start, it just says that it does not find the file. Although the model is compiled and is in the folder.

@sahmed53
Copy link

sahmed53 commentedMay 18, 2024
edited
Loading

I have posted this in the main issues section too (i apologise for the double post), but think maybe here people might be able to reply since it is a specific CoreML thread. My problem is about using CoreML in iOS apps i have noticed that the size of the app jump dramatically everytime coreML is fired up. Downloading the app container in xcode doesnt seem to show why the "documents and data" increases to many mb and sometimes gb with repeated usage. So i was wondering if anyone here has used the Objective-C sample or similar, can they check the app size after running - setting -> general -> storage? where could the app be saving coreml files? what could be going on?

This only happens with CoreML not Metal
I have cleared caches and temp files, but it doesnt effect the documents and data
xcode container doesnt not equal the same size as the settings indicator
I have looked in instruments but can't find the directory where files are being written to, it shows a tmp folder being written to with ANE weights?

This issue means it can't be deployed in production ready apps?

Screenshot 2024-05-18 at 15 14 25

Please someone help!

aehlke reacted with confused emoji

@day-dreaming-guy
Copy link

day-dreaming-guy commentedJul 28, 2024
edited
Loading

Hey@sahmed53 ! Have you solved it?

@bjnortier
Copy link
Contributor

When you load a CoreML model the first time it does an optimisation and it saves that optimised model somewhere. I could never figure out where – it is something internal and hidden. I suspect that's what you're seeing. Sometimes the OS will delete those files (I assume when storage is low) and then when you load the CoreML model again it will do the optimisation step again. This can take very long on some devices.

This is why I've stopped using CoreML for my app and I only use the Metal version.

aviaryan reacted with thumbs up emoji

@aehlke
Copy link

aehlke commentedJul 29, 2024
edited
Loading

@bjnortier does WhisperKit suffer from the same issue? they became quite popular and rely on CoreML rather than Metal

@bjnortier
Copy link
Contributor

@aehlke Yes, if you use the WhisperKit macOS TestFlight app you will see "Specializing [...] for you device... This can take several minutes on first load"

aehlke reacted with thumbs up emoji

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull requestSep 23, 2024
* coreml : use Core ML encoder inference* coreml : simlpify whisper_encode + log messages* whisper : resolve rebase conflicts* coreml : add scripts for CoreML model generation* bench-all : recognize COREML flag
@androslaforc
Copy link

Hi, which version of Python should I use to install these dependencies? I tried 3.11 and 3.10, but failed to install all dependcecies.

pip install ane_transformerspip install openai-whisperpip install coremltools

have you got any answer about this ?

@androslaforc
Copy link

Hey, for people who are still struggling with trying this out, here is a bit of a script that shows what my environment is like. I've had a bit of trouble getting this running, so here is my attempt to make a minimally reproducible set of commands. I haven't checked speed improvements yet. This is on a macbook air M1, running Ventura 13.3.1

# I also had problems with xcode, this helped me when it didn't work after reinstalling xcode, thanks @neurostarsudo xcode-select --switch /Applications/Xcode.app/Contents/Developer# setting up coreml conda environment, 3.9 works for meconda create --name core_whisper_3_9 python=3.9 -y conda activate core_whisper_3_9# torchvision and tensorflow aren't necessary, but I've had a bit of a problem getting coreml itself to work, so this is to run their example programpip install torchvision==0.15.1 tensorflow-macos==2.9 coremltools==6.3 ane-transformers==0.1.1 openai-whisper==20230314# checking whether coreml itself works correctly, no whisper.cpp involvement, code copied from [coremltools.readme.io](https://coremltools.readme.io/docs/convert-a-torchvision-model-from-pytorch)python -c'import coremltools as ctimport torchimport torchvision# Load PyTorch model (and perform tracing)torch_model = torchvision.models.mobilenet_v2()torch_model.eval()example_input = torch.rand(1, 3, 256, 256)traced_model = torch.jit.trace(torch_model, example_input)# Convert using the same API. Note that we need to provide "inputs" for pytorch conversion.model_from_torch = ct.convert(traced_model,                              inputs=[ct.TensorType(name="input",                                                    shape=example_input.shape)],                              debug=True)'# ensure whisper.cpp repo is in the same state as mine wascd<mypath>/whisper.cppgit pullgit checkout v1.3.0git clean -idx# interactive: remove downloaded models. you can skip this step, but it ensures you have a fresh install of the models# building whisper.cpp with coreml supportmkdir buildcd buildcmake -DWHISPER_COREML=1 ..make -jcd ..# download model and convert to coremlbash ./models/download-ggml-model.sh tinybash ./models/generate-coreml-model.sh tiny# try out jfk sample./build/bin/main -m models/ggml-tiny.bin -f samples/jfk.wav

This led to the following output for the sample for me:

whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'whisper_model_load: loading modelwhisper_model_load: n_vocab       = 51865whisper_model_load: n_audio_ctx   = 1500whisper_model_load: n_audio_state = 384whisper_model_load: n_audio_head  = 6whisper_model_load: n_audio_layer = 4whisper_model_load: n_text_ctx    = 448whisper_model_load: n_text_state  = 384whisper_model_load: n_text_head   = 6whisper_model_load: n_text_layer  = 4whisper_model_load: n_mels        = 80whisper_model_load: f16           = 1whisper_model_load: type          = 1whisper_model_load: mem required  =  129.00 MB (+    3.00 MB per decoder)whisper_model_load: adding 1608 extra tokenswhisper_model_load: model ctx     =   73.58 MBwhisper_model_load: model size    =   73.54 MBwhisper_init_state: kv self size  =    2.62 MBwhisper_init_state: kv cross size =    8.79 MBwhisper_init_state: loading Core ML model from 'models/ggml-tiny-encoder.mlmodelc'whisper_init_state: first run on a device may take a while ...whisper_init_state: Core ML model loadedsystem_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.whisper_print_timings:     load time =    97.74 mswhisper_print_timings:     fallbacks =   0 p /   0 hwhisper_print_timings:      mel time =    41.69 mswhisper_print_timings:   sample time =    10.67 ms /    25 runs (    0.43 ms per run)whisper_print_timings:   encode time =    33.32 ms /     1 runs (   33.32 ms per run)whisper_print_timings:   decode time =    38.86 ms /    25 runs (    1.55 ms per run)whisper_print_timings:    total time =  4314.09 ms

Hope this helps someone.

Thanks a lot bro it works i was stuck i love you

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

42 participants
@ggerganov@brozkrut@DontEatOreo@dennislysenko@strangelearning@cerupcat@lucabeetz@aehlke@neurostar@CarberryChai@flexchar@flyisland@adolphnov@wzxu@sriver@bjnortier@ecormany@MathanGeurtsen@artemgordinskiy@joshuawwy@NextDoorLaoHuang-HF@Gh8st@realcarlos@jacob-salassi@ArtemBernatskyy@rstockm@miguelfrmn@zaptrem@astrowonk@JoshuaEbbert@dreampuf@togume@vadi2@ganqqwerty@cust0mphase@dhwkdjwndjwjjn@helins@RazeBerry@shell1986@sahmed53@day-dreaming-guy@androslaforc

[8]ページ先頭

©2009-2025 Movatter.jp