- Notifications
You must be signed in to change notification settings - Fork0
Port of OpenAI's Whisper model in C/C++
License
macocha/whisper.cpp
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Stable:v1.2.1 /Roadmap | F.A.Q.
High-performance inference ofOpenAI's Whisper automatic speech recognition (ASR) model:
- Plain C/C++ implementation without dependencies
- Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
- AVX intrinsics support for x86 architectures
- VSX intrinsics support for POWER architectures
- Mixed F16 / F32 precision
- Low memory usage (Flash Attention)
- Zero memory allocations at runtime
- Runs on the CPU
- C-style API
Supported platforms:
- Mac OS (Intel and Arm)
- iOS
- Android
- Linux /FreeBSD
- WebAssembly
- Windows (MSVC andMinGW]
- Raspberry Pi
The entire implementation of the model is contained in 2 source files:
- Tensor operations:ggml.h /ggml.c
- Transformer inference:whisper.h /whisper.cpp
Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device:whisper.objc
whisper-iphone-13-mini-2.mp4
You can also easily make your own offline voice assistant application:command
command-0.mp4
Or you can even run it straight in the browser:talk.wasm
- The core tensor operations are implemented in C (ggml.h /ggml.c)
- The transformer model and the high-level C-style API are implemented in C++ (whisper.h /whisper.cpp)
- Sample usage is demonstrated inmain.cpp
- Sample real-time audio transcription from the microphone is demonstrated instream.cpp
- Various other examples are available in theexamples folder
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMDinstrisics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes sincethe Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
First, download one of the Whisper models converted inggml format. For example:
bash ./models/download-ggml-model.sh base.en
Now build themain example and transcribe an audio file like this:
# build the main examplemake# transcribe an audio file./main -f samples/jfk.wav
For a quick demo, simply runmake base.en:
$makebase.encc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -cggml.c -oggml.oc++ -I. -I./examples -O3 -std=c++11 -pthread -cwhisper.cpp -owhisper.oc++ -I. -I./examples -O3 -std=c++11 -pthreadexamples/main/main.cppwhisper.oggml.o -omain -frameworkAccelerate./main -husage: ./main [options]file0.wavfile1.wav ...options: -h, --help [default]showthishelpmessageandexit -tN, --threadsN [4 ]numberofthreadstouseduringcomputation -pN, --processorsN [1 ]numberofprocessorstouseduringcomputation -otN, --offset-tN [0 ]timeoffsetinmilliseconds -onN, --offset-nN [0 ]segmentindexoffset -dN, --durationN [0 ]durationofaudiotoprocessinmilliseconds -mcN, --max-contextN [-1 ]maximumnumberoftextcontexttokenstostore -mlN, --max-lenN [0 ]maximumsegmentlengthincharacters -boN, --best-ofN [5 ]numberofbestcandidatestokeep -bsN, --beam-sizeN [-1 ]beamsizeforbeamsearch -wtN, --word-tholdN [0.01 ]wordtimestampprobabilitythreshold -etN, --entropy-tholdN [2.40 ]entropythresholdfordecoderfail -lptN, --logprob-tholdN [-1.00 ]logprobabilitythresholdfordecoderfail -su, --speed-up [false ]speedupaudiobyx2 (reducedaccuracy) -tr, --translate [false ]translatefromsourcelanguagetoenglish -di, --diarize [false ]stereoaudiodiarization -nf, --no-fallback [false ]donotusetemperaturefallbackwhiledecoding -otxt, --output-txt [false ]outputresultinatextfile -ovtt, --output-vtt [false ]outputresultinavttfile -osrt, --output-srt [false ]outputresultinasrtfile -owts, --output-words [false ]outputscriptforgeneratingkaraokevideo -ocsv, --output-csv [false ]outputresultinaCSVfile -ofFNAME, --output-fileFNAME [ ]outputfilepath (withoutfileextension) -ps, --print-special [false ]printspecialtokens -pc, --print-colors [false ]printcolors -pp, --print-progress [false ]printprogress -nt, --no-timestamps [true ]donotprinttimestamps -lLANG, --languageLANG [en ]spokenlanguage ('auto'forauto-detect) --promptPROMPT [ ]initialprompt -mFNAME, --modelFNAME [models/ggml-base.en.bin]modelpath -fFNAME, --fileFNAME [ ]inputWAVfilepathbash ./models/download-ggml-model.shbase.enDownloadingggmlmodelbase.en ...ggml-base.en.bin100%[========================>]141.11M6.34MB/sin24sDone!Model'base.en'savedin'models/ggml-base.en.bin'Youcannowuseitlikethis:$ ./main -mmodels/ggml-base.en.bin -fsamples/jfk.wav===============================================Runningbase.enonallsamplesin ./samples ...===============================================----------------------------------------------[+]Runningbase.enonsamples/jfk.wav ... (run'ffplay samples/jfk.wav'tolisten)----------------------------------------------whisper_init_from_file:loadingmodelfrom'models/ggml-base.en.bin'whisper_model_load:loadingmodelwhisper_model_load:n_vocab =51864whisper_model_load:n_audio_ctx =1500whisper_model_load:n_audio_state =512whisper_model_load:n_audio_head =8whisper_model_load:n_audio_layer =6whisper_model_load:n_text_ctx =448whisper_model_load:n_text_state =512whisper_model_load:n_text_head =8whisper_model_load:n_text_layer =6whisper_model_load:n_mels =80whisper_model_load:f16 =1whisper_model_load:type =2whisper_model_load:memrequired =215.00MB (+6.00MBperdecoder)whisper_model_load:kvselfsize =5.25MBwhisper_model_load:kvcrosssize =17.58MBwhisper_model_load:adding1607extratokenswhisper_model_load:modelctx =140.60MBwhisper_model_load:modelsize =140.54MBsystem_info:n_threads =4 /10 |AVX =0 |AVX2 =0 |AVX512 =0 |FMA =0 |NEON =1 |ARM_FMA =1 |F16C =0 |FP16_VA =1 |WASM_SIMD =0 |BLAS =1 |SSE3 =0 |VSX =0 |main:processing'samples/jfk.wav' (176000samples,11.0sec),4threads,1processors,lang =en,task =transcribe,timestamps =1 ...[00:00:00.000 -->00:00:11.000]AndsomyfellowAmericans,asknotwhatyourcountrycandoforyou,askwhatyoucandoforyourcountry.whisper_print_timings:fallbacks =0p /0hwhisper_print_timings:loadtime =113.81mswhisper_print_timings:meltime =15.40mswhisper_print_timings:sampletime =11.58ms /27runs (0.43msperrun)whisper_print_timings:encodetime =266.60ms /1runs (266.60msperrun)whisper_print_timings:decodetime =66.11ms /27runs (2.45msperrun)whisper_print_timings:totaltime =476.31ms
The command downloads thebase.en model converted to customggml format and runs the inference on all.wav samples in the foldersamples.
For detailed usage instructions, run:./main -h
Note that themain example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.For example, you can useffmpeg like this:
ffmpeg -iinput.mp3 -ar16000 -ac1 -c:apcm_s16leoutput.wav
If you want some extra audio samples to play with, simply run:
make samplesThis will download a few more audio files from Wikipedia and convert them to 16-bit WAV format viaffmpeg.
You can download and run the other models as follows:
make tiny.enmake tinymake base.enmake basemake small.enmake smallmake medium.enmake mediummake large-v1make large| Model | Disk | Mem | SHA |
|---|---|---|---|
| tiny | 75 MB | ~125 MB | bd577a113a864445d4c299885e0cb97d4ba92b5f |
| base | 142 MB | ~210 MB | 465707469ff3a37a2b9b8d8f89f2f99de7299dac |
| small | 466 MB | ~600 MB | 55356645c2b361a969dfd0ef2c5a50d530afd8d5 |
| medium | 1.5 GB | ~1.7 GB | fd9727b6e1217c2f614f9b698455c4ffd82463b4 |
| large | 2.9 GB | ~3.3 GB | 0f4c8e34f21cf1a914c59d8b3ce882345ad349d6 |
- Inference only
- No GPU support (yet)
Here is another example of transcribing a3:24 min speechin about half a minute on a MacBook M1 Pro, usingmedium.en model:
Expand to see the result
$ ./main -mmodels/ggml-medium.en.bin -fsamples/gb1.wav -t8whisper_init_from_file:loadingmodelfrom'models/ggml-medium.en.bin'whisper_model_load:loadingmodelwhisper_model_load:n_vocab =51864whisper_model_load:n_audio_ctx =1500whisper_model_load:n_audio_state =1024whisper_model_load:n_audio_head =16whisper_model_load:n_audio_layer =24whisper_model_load:n_text_ctx =448whisper_model_load:n_text_state =1024whisper_model_load:n_text_head =16whisper_model_load:n_text_layer =24whisper_model_load:n_mels =80whisper_model_load:f16 =1whisper_model_load:type =4whisper_model_load:memrequired =1720.00MB (+43.00MBperdecoder)whisper_model_load:kvselfsize =42.00MBwhisper_model_load:kvcrosssize =140.62MBwhisper_model_load:adding1607extratokenswhisper_model_load:modelctx =1462.35MBwhisper_model_load:modelsize =1462.12MBsystem_info:n_threads =8 /10 |AVX =0 |AVX2 =0 |AVX512 =0 |FMA =0 |NEON =1 |ARM_FMA =1 |F16C =0 |FP16_VA =1 |WASM_SIMD =0 |BLAS =1 |SSE3 =0 |VSX =0 |main:processing'samples/gb1.wav' (3179750samples,198.7sec),8threads,1processors,lang =en,task =transcribe,timestamps =1 ...[00:00:00.000 -->00:00:08.000]MyfellowAmericans,thisdayhasbroughtterriblenewsandgreatsadnesstoourcountry.[00:00:08.000 -->00:00:17.000]Atnineo'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.[00:00:17.000 -->00:00:23.000]Ashorttimelater,debriswasseenfallingfromtheskiesaboveTexas.[00:00:23.000 -->00:00:29.000]TheColumbia's lost. There are no survivors.[00:00:29.000 -->00:00:32.000]Onboardwasacrewofseven.[00:00:32.000 -->00:00:39.000]ColonelRickHusband,LieutenantColonelMichaelAnderson,CommanderLaurelClark,[00:00:39.000 -->00:00:48.000]CaptainDavidBrown,CommanderWilliamMcCool,Dr.KultnaShavla,andIlanRamon,[00:00:48.000 -->00:00:52.000]acolonelintheIsraeliAirForce.[00:00:52.000 -->00:00:58.000]Thesemenandwomenassumedgreatriskintheservicetoallhumanity.[00:00:58.000 -->00:01:03.000]Inanagewhenspaceflighthascometoseemalmostroutine,[00:01:03.000 -->00:01:07.000]itiseasytooverlookthedangersoftravelbyrocket[00:01:07.000 -->00:01:12.000]andthedifficultiesofnavigatingthefierceouteratmosphereoftheEarth.[00:01:12.000 -->00:01:18.000]Theseastronautsknewthedangers,andtheyfacedthemwillingly,[00:01:18.000 -->00:01:23.000]knowingtheyhadahighandnoblepurposeinlife.[00:01:23.000 -->00:01:31.000]Becauseoftheircourageanddaringandidealism,wewillmissthemallthemore.[00:01:31.000 -->00:01:36.000]AllAmericanstodayarethinkingaswellofthefamiliesofthesemenandwomen[00:01:36.000 -->00:01:40.000]whohavebeengiventhissuddenshockandgrief.[00:01:40.000 -->00:01:45.000]You're not alone. Our entire nation grieves with you,[00:01:45.000 -->00:01:52.000]andthoseyoulovewillalwayshavetherespectandgratitudeofthiscountry.[00:01:52.000 -->00:01:56.000]Thecauseinwhichtheydiedwillcontinue.[00:01:56.000 -->00:02:04.000]Mankindisledintothedarknessbeyondourworldbytheinspirationofdiscovery[00:02:04.000 -->00:02:11.000]andthelongingtounderstand.Ourjourneyintospacewillgoon.[00:02:11.000 -->00:02:16.000]Intheskiestoday,wesawdestructionandtragedy.[00:02:16.000 -->00:02:22.000]Yetfartherthanwecansee,thereiscomfortandhope.[00:02:22.000 -->00:02:29.000]InthewordsoftheprophetIsaiah,"Lift your eyes and look to the heavens[00:02:29.000 --> 00:02:35.000] who created all these. He who brings out the starry hosts one by one[00:02:35.000 --> 00:02:39.000] and calls them each by name."[00:02:39.000 -->00:02:46.000]BecauseofHisgreatpowerandmightystrength,notoneofthemismissing.[00:02:46.000 -->00:02:55.000]ThesameCreatorwhonamesthestarsalsoknowsthenamesofthesevensoulswemourntoday.[00:02:55.000 -->00:03:01.000]ThecrewoftheshuttleColumbiadidnotreturnsafelytoearth,[00:03:01.000 -->00:03:05.000]yetwecanpraythatallaresafelyhome.[00:03:05.000 -->00:03:13.000]MayGodblessthegrievingfamilies,andmayGodcontinuetoblessAmerica.[00:03:13.000 -->00:03:19.000] [Silence]whisper_print_timings:fallbacks =1p /0hwhisper_print_timings:loadtime =569.03mswhisper_print_timings:meltime =146.85mswhisper_print_timings:sampletime =238.66ms /553runs (0.43msperrun)whisper_print_timings:encodetime =18665.10ms /9runs (2073.90msperrun)whisper_print_timings:decodetime =13090.93ms /549runs (23.85msperrun)whisper_print_timings:totaltime =32733.52ms
This is a naive example of performing real-time inference on audio from your microphone.Thestream tool samples the audio every half a second and runs the transcription continously.More info is available inissue #10.
makestream./stream -m ./models/ggml-base.en.bin -t8 --step500 --length5000
rt_esl_csgo_2.mp4
Adding the--print-colors argument will print the transcribed text using an experimental color coding strategyto highlight words with high or low confidence:
For example, to limit the line length to a maximum of 16 characters, simply add-ml 16:
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml16whisper_model_load:loadingmodelfrom'./models/ggml-base.en.bin'...system_info:n_threads =4 /10 |AVX2 =0 |AVX512 =0 |NEON =1 |FP16_VA =1 |WASM_SIMD =0 |BLAS =1 |main:processing'./samples/jfk.wav' (176000samples,11.0sec),4threads,1processors,lang =en,task =transcribe,timestamps =1 ...[00:00:00.000 -->00:00:00.850]Andsomy[00:00:00.850 -->00:00:01.590]fellow[00:00:01.590 -->00:00:04.140]Americans,ask[00:00:04.140 -->00:00:05.660]notwhatyour[00:00:05.660 -->00:00:06.840]countrycando[00:00:06.840 -->00:00:08.430]foryou,ask[00:00:08.430 -->00:00:09.440]whatyoucando[00:00:09.440 -->00:00:10.020]foryour[00:00:10.020 -->00:00:11.000]country.
The--max-len argument can be used to obtain word-level timestamps. Simply use-ml 1:
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml1whisper_model_load:loadingmodelfrom'./models/ggml-base.en.bin'...system_info:n_threads =4 /10 |AVX2 =0 |AVX512 =0 |NEON =1 |FP16_VA =1 |WASM_SIMD =0 |BLAS =1 |main:processing'./samples/jfk.wav' (176000samples,11.0sec),4threads,1processors,lang =en,task =transcribe,timestamps =1 ...[00:00:00.000 -->00:00:00.320] [00:00:00.320 -->00:00:00.370]And[00:00:00.370 -->00:00:00.690]so[00:00:00.690 -->00:00:00.850]my[00:00:00.850 -->00:00:01.590]fellow[00:00:01.590 -->00:00:02.850]Americans[00:00:02.850 -->00:00:03.300] ,[00:00:03.300 -->00:00:04.140]ask[00:00:04.140 -->00:00:04.990]not[00:00:04.990 -->00:00:05.410]what[00:00:05.410 -->00:00:05.660]your[00:00:05.660 -->00:00:06.260]country[00:00:06.260 -->00:00:06.600]can[00:00:06.600 -->00:00:06.840]do[00:00:06.840 -->00:00:07.010]for[00:00:07.010 -->00:00:08.170]you[00:00:08.170 -->00:00:08.190] ,[00:00:08.190 -->00:00:08.430]ask[00:00:08.430 -->00:00:08.910]what[00:00:08.910 -->00:00:09.040]you[00:00:09.040 -->00:00:09.320]can[00:00:09.320 -->00:00:09.440]do[00:00:09.440 -->00:00:09.760]for[00:00:09.760 -->00:00:10.020]your[00:00:10.020 -->00:00:10.510]country[00:00:10.510 -->00:00:11.000] .
Themain example provides support for output of karaoke-style movies, where thecurrently pronounced word is highlighted. Use the-wts argument and run the generated bash script.This requires to haveffmpeg installed.
Here are a few"typical" examples:
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owtssource ./samples/jfk.wav.wtsffplay ./samples/jfk.wav.mp4
jfk.wav.mp4
./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owtssource ./samples/mm0.wav.wtsffplay ./samples/mm0.wav.mp4
mm0.wav.mp4
./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owtssource ./samples/gb0.wav.wtsffplay ./samples/gb0.wav.mp4
gb0.wav.mp4
Use theextra/bench-wts.sh script to generate a video in the following format:
./extra/bench-wts.shsamples/jfk.wavffplay ./samples/jfk.wav.all.mp4
jfk.wav.all.mp4
In order to have an objective comparison of the performance of the inference across different system configurations,use thebench tool. The tool simply runs the Encoder part of the model and prints how much time ittook to execute it. The results are summarized in the following Github issue:
The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
- model parameters
- mel filters
- vocabulary
- weights
You can download the converted models using themodels/download-ggml-model.sh scriptor manually from here:
For more details, see the conversion scriptmodels/convert-pt-to-ggml.py or the READMEinmodels.
- Rust:tazz4843/whisper-rs |#310
- #"/macocha/whisper.cpp/blob/master/bindings/javascript">bindings/javascript
- stlukey/whispercpp.py (Cython)
- aarnphm/whispercpp (Pybind11)
There are various examples of using the library for different projects in theexamples folder.Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
| Example | Web | Description |
|---|---|---|
| main | whisper.wasm | Tool for translating and transcribing audio using Whisper |
| bench | bench.wasm | Benchmark the performance of Whisper on your machine |
| stream | stream.wasm | Real-time transcription of raw microphone capture |
| command | command.wasm | Basic voice assistant example for receiving voice commands from the mic |
| talk | talk.wasm | Talk with a GPT-2 bot |
| whisper.objc | iOS mobile application using whisper.cpp | |
| whisper.swiftui | SwiftUI iOS / macOS application using whisper.cpp | |
| whisper.android | Android mobile application using whisper.cpp | |
| whisper.nvim | Speech-to-text plugin for Neovim | |
| generate-karaoke.sh | Helper script to easilygenerate a karaoke video of raw audio capture | |
| livestream.sh | Livestream audio transcription | |
| yt-wsp.sh | Download + transcribe and/or translate any VOD(original) |
If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic.You can use theShow and tell categoryto share your own projects that usewhisper.cpp. If you have a question, make sure to check theFrequently asked questions (#126) discussion.
About
Port of OpenAI's Whisper model in C/C++
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- C50.2%
- C++31.6%
- Go8.4%
- Python2.8%
- CMake2.5%
- Shell1.8%
- Other2.7%
