NotificationsYou must be signed in to change notification settings
Fork4.5k
Star41.6k

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize#1058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

ggerganov merged 17 commits intoggml-org:masterfromakashmjn:tdrz-integrate-1

Jul 4, 2023

Merged

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize#1058

ggerganov merged 17 commits intoggml-org:masterfromakashmjn:tdrz-integrate-1

Jul 4, 2023

Conversation

Copy link

Contributor

akashmjn commentedJun 27, 2023•
edited by ggerganov
Loading

As discussed in#64, this PR adds experimental support for local diarization (marking of speaker turns) via integration of checkpoints from this projecthttps://github.com/akashmjn/tinydiarize/tree/main.

This is an early functional prototype done for thesmall.en models.

@ggerganov - this should be functionally done save for the last two points on the checklist, for which i'd appreciate some comments on the right way to expose this.

(also please excuse my C++ , I haven't written a lot of it, so this is heavily copilot-assisted 😉 )

Example usage

make./models/download-ggml-model.sh small.en-tdrzmake samples./main -m models/ggml-small.en-tdrz.bin -f samples/a13.wav

After running the above, you should see this:

JSON output contains an extraspeaker_turn_next field for each segment with this information.

Example JSON output

{"systeminfo": "AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | COREML = 0 | ","model": {"type": "small","multilingual": false,"vocab": 51864,"audio": {"ctx": 1500,"state": 768,"head": 12,"layer": 12},"text": {"ctx": 448,"state": 768,"head": 12,"layer": 12},"mels": 80,"ftype": 1},"params": {"model": "models/whisper-small.en.tdrz/ggml-small.en-tdrz.bin","language": "en","translate": false},"result": {"language": "en"},"transcription": [{"timestamps": {"from": "00:00:00,000","to": "00:00:03,800"},"offsets": {"from": 0,"to": 3800},"text": " Okay Houston, we've had a problem here. [SPEAKER TURN]""speaker_turn_next": true},                ...]}

Checklist:

script for conversion of pytorch checkpointsMake convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001
updated download script with hosted of checkpointsakashmjn@7f0dc9b
fixing a bug in the repo where translate/transcribe token_ids were incorrect for .en modelsakashmjn@c8e1ed6[I can separate this out into another PR if you like]
inference code changes for decoding of speaker turn tokens
recording speaker turns in all outputs
expose configurable behaviour to user via flag
resolve consistency with existing--diarize flag

Some terminology context for the last two points: this is technically not complete diarization yet, but speaker segmentationhttps://www.perplexity.ai/search/d01e6743-d2dc-4f5e-b5c2-2bf2212068f7?s=u (which can be thought of as local diarization).
Also technically the stereo audio input used by the current--diarize flag is already diarized (as it is separated into individual channels), so the naming isn't strictly consistent here either?

akashmjn added6 commits

June 26, 2023 02:42

add HuggingFace mirror to download ggml model

7f0dc9b

support tdrz via simple hack overriding solm tokens

62c851b

fix incorrect translate/transcribe token_ids that are not static const

c8e1ed6

add apollo 13 sample for tdrz demo

700c282

render [SPEAKER TURN] consistently in all terminal output using vocab…

4083a39

….id_to_token

extend whisper_segment with speaker_turn_next field and save in json …

713c5b6

…output

akashmjn changed the title~~whisper: support speaker segmentation (local diarization) of mono audio via integration of tinydiarize~~whisper: support speaker segmentation (local diarization) of mono audio via tinydiarize

Jun 27, 2023

akashmjn mentioned this pull request

Jun 27, 2023

whisper : mark speakers/voices (diarization)#64

Open

akashmjn added2 commits

June 27, 2023 11:26

fix failing go build

edd2348

slipped in some python syntax whoops

77825ec

Copy link

JianbangZ commentedJun 28, 2023

Does this support multi language or just English?

Copy link

skye-repos commentedJun 29, 2023

Excited! Will this support multiple speaker labelling or will it just mark speaker turns?

Copy link

ContributorAuthor

akashmjn commentedJun 30, 2023•
edited
Loading

Hi @Harith163 and@JianbangZ:

at the moment, just speaker turns and no clustering
this PR is merging a PoC done for thesmall.en models, so English-only

Both of these are doable I think, but are a little more involved and honestly depends on how the project evolves.

For multilingual - I think its easiest done by OpenAI themselves since ultimately that boils down to a reasonably multilingual finetuning dataset, and I'm pretty sure all released Whisper models had a final finetuning stage.

I'd say clustering has less dependencies and is a bit more tractable. I will sketch a rough plan for that once a few immediate things are done.

You can take a look at the immediate roadmap over athttps://github.com/akashmjn/tinydiarize/tree/main#roadmap.

Copy link

ContributorAuthor

akashmjn commentedJun 30, 2023

In fact@ggerganov I notice that you've already implemented C-means by hand in cpp here#130 😅 . Once I free up a little, I'll try running some clustering experiments over on the python repo.

In the meantime if you are interested, this is the best method out thereNME-SC:

walkthrough fromslide 10 onwards here
implementation: mostly standard matrix/linalg operations + k-means

Copy link

Member

ggerganov commentedJun 30, 2023

Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation)

Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy

sandrohanea mentioned this pull request

Jul 1, 2023

Identifying two speakerssandrohanea/whisper.net#87

Closed

akashmjn commented

Jul 2, 2023

View reviewed changes

whisper.cpp OutdatedShow resolvedHide resolved

Copy link

ContributorAuthor

akashmjn commentedJul 2, 2023•
edited
Loading

Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation)
Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy

Sounds good! For the last two points on my checklist - for now, i'll wait for your review. I've left//TODO@Akash at places where the behaviour needs to be toggled. If you find it more efficient - free to directly modify the PR however you find it best to expose this feature.

I think it should just be clear to the user that this is an experimental feature and requires using a specific*.tdrz checkpoint.

Merge branch 'master' into tdrz-integrate-1

12d3f90

Copy link

Member

ggerganov commentedJul 2, 2023

I synced latestsggml fromllama.cpp and tomorrow will add the config option fortinydiarize and merge

ggerganov mentioned this pull request

Jul 2, 2023

runs perfectly with the regular models, but not the quantized ones#993

Open

Merge branch 'master' into tdrz-integrate-1

024211f

Copy link

ohmguru commentedJul 3, 2023

Excited to see this PR merged. Noticed that this PR doesn't yet support the word-level timestamp flag. I wanted to flag that for consideration as Word level timestamps are quite helpful when building applications that show diarization output.

ggerganov added2 commits

July 3, 2023 20:24

whisper : finalize tinydiarize support (add flag + fixes)

59e9055

whisper : tdrz support for word-level timestamps (respect max_len)

8ee5af4

Copy link

Member

ggerganov commentedJul 3, 2023

@akashmjn

This should be ready to merge now. Please take a look at my changes and let me know if you agree.
For now, lets leave the stereo "diairze" flag as it is - will rename it later to reflect what it actually does.

The most important change is that I addedtoken_tdrz and kepttoken_solm as it is.

Also, you now have to add the-tdrz flag to explicitly enable speaker turn detection even when usingtindiarize models.
The flag should not do anything if the model used is not atinydiarize one.

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrzmain: processing'./samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...[00:00:00.000 --> 00:00:03.800]   Okay Houston, we've had a problem here. [SPEAKER_TURN][00:00:03.800 --> 00:00:06.200]   This is Houston. Say again please. [SPEAKER_TURN][00:00:06.200 --> 00:00:08.260]   Uh Houston we've had a problem.[00:00:08.260 --> 00:00:11.320]   We've had a main beam up on a volt. [SPEAKER_TURN][00:00:11.320 --> 00:00:13.820]   Roger main beam interval. [SPEAKER_TURN][00:00:13.820 --> 00:00:15.100]   Uh uh [SPEAKER_TURN][00:00:15.100 --> 00:00:18.020]   So okay stand, by thirteen we're looking at it. [SPEAKER_TURN][00:00:18.020 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Here is without it:

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.binmain: processing'./samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...[00:00:00.000 --> 00:00:03.760]   Okay Houston, we've had a problem here.[00:00:03.760 --> 00:00:08.340]   Uh Houston we've had a problem.[00:00:08.340 --> 00:00:11.320]   We've had a main beam up on a volt.[00:00:11.320 --> 00:00:13.760]   Roger main beam interval.[00:00:13.760 --> 00:00:17.960]   So okay stand, by thirteen we're looking at it.[00:00:17.960 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Here is word-level timestamps with speaker turn detection:

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -ml 1 -sow -tdrzmain: processing'./samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...[00:00:00.000 --> 00:00:00.060]  [00:00:00.060 --> 00:00:00.500]   Okay[00:00:00.500 --> 00:00:01.340]   Houston,[00:00:01.340 --> 00:00:01.850]   we've[00:00:01.850 --> 00:00:02.160]   had[00:00:02.160 --> 00:00:02.260]   a[00:00:02.260 --> 00:00:02.990]   problem[00:00:02.990 --> 00:00:03.800]   here. [SPEAKER_TURN][00:00:03.800 --> 00:00:04.030]   This[00:00:04.030 --> 00:00:04.140]   is[00:00:04.140 --> 00:00:04.710]   Houston.[00:00:04.710 --> 00:00:04.880]   Say[00:00:04.880 --> 00:00:05.170]   again[00:00:05.170 --> 00:00:06.200]   please. [SPEAKER_TURN][00:00:06.200 --> 00:00:06.340]   Uh[00:00:06.340 --> 00:00:06.850]   Houston[00:00:06.850 --> 00:00:07.210]   we've[00:00:07.210 --> 00:00:07.430]   had[00:00:07.430 --> 00:00:07.530]   a[00:00:07.530 --> 00:00:08.260]   problem.[00:00:08.260 --> 00:00:08.770]   We've[00:00:08.770 --> 00:00:09.080]   had[00:00:09.080 --> 00:00:09.180]   a[00:00:09.180 --> 00:00:09.610]   main[00:00:09.610 --> 00:00:10.000]   beam[00:00:10.000 --> 00:00:10.200]   up[00:00:10.200 --> 00:00:10.400]   on[00:00:10.400 --> 00:00:10.500]   a[00:00:10.500 --> 00:00:11.320]   volt. [SPEAKER_TURN][00:00:11.320 --> 00:00:11.840]   Roger[00:00:11.840 --> 00:00:12.250]   main[00:00:12.250 --> 00:00:12.740]   beam[00:00:12.740 --> 00:00:13.820]   interval. [SPEAKER_TURN][00:00:13.820 --> 00:00:15.080]   Uh[00:00:15.080 --> 00:00:15.100]   uh [SPEAKER_TURN][00:00:15.100 --> 00:00:15.230]   So[00:00:15.230 --> 00:00:15.500]   okay[00:00:15.500 --> 00:00:15.970]   stand,[00:00:15.970 --> 00:00:16.100]   by[00:00:16.100 --> 00:00:16.660]   thirteen[00:00:16.660 --> 00:00:16.980]   we're[00:00:16.980 --> 00:00:17.460]   looking[00:00:17.460 --> 00:00:17.610]   at[00:00:17.610 --> 00:00:18.020]   it. [SPEAKER_TURN][00:00:18.020 --> 00:00:18.570]   Okay[00:00:18.570 --> 00:00:18.840]   uh[00:00:18.840 --> 00:00:19.530]   right[00:00:19.530 --> 00:00:19.940]   now[00:00:19.940 --> 00:00:20.210]   uh[00:00:20.210 --> 00:00:21.170]   Houston[00:00:21.170 --> 00:00:21.580]   the[00:00:21.580 --> 00:00:21.850]   uh[00:00:21.850 --> 00:00:22.810]   voltage[00:00:22.810 --> 00:00:23.080]   is[00:00:23.080 --> 00:00:23.400]   uh[00:00:23.400 --> 00:00:23.730]   is[00:00:23.730 --> 00:00:24.810]   looking[00:00:24.810 --> 00:00:25.440]   good[00:00:25.440 --> 00:00:25.740]   um.[00:00:27.620 --> 00:00:27.670]  [00:00:27.670 --> 00:00:27.840]   And[00:00:27.840 --> 00:00:27.980]   we[00:00:27.980 --> 00:00:28.210]   had[00:00:28.210 --> 00:00:28.270]   a[00:00:28.270 --> 00:00:28.340]   a[00:00:28.340 --> 00:00:28.780]   pretty[00:00:28.780 --> 00:00:29.150]   large[00:00:29.150 --> 00:00:29.440]   bank[00:00:29.440 --> 00:00:29.580]   or[00:00:29.580 --> 00:00:29.940]   so.

ggerganov added3 commits

July 3, 2023 20:44

java : try to fix tests after adding tdrz_enable flag

5fa32da

main : remove TODO leftover

6828be7

java : fix params order list after adding "tdrz_enable"

09c32a6

akashmjn commented

Jul 3, 2023

View reviewed changes

Copy link

ContributorAuthor

akashmjn left a comment