- Notifications
You must be signed in to change notification settings - Fork964
Open
Description
System Info
transformers.js: 3.6.1
Environment/Platform
- Website/web-app
- Browser extension
- Server-side (e.g., Node.js, Deno, Bun)
- Desktop app (e.g., Electron)
- Other (e.g., VSCode extension)
Description
Using chunk_length_s=30 andonnx-community/whisper-base_timestamped
produces broken timestamsp
Reproduction
Run the following code and notice the output in console log using the attachedsrc.pcm (in .zip)
<scripttype="module">const{env,pipeline} = await import("https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.6.1/dist/transformers.min.js");env.allowLocalModels = false;const buffer = await (await fetch("src.pcm")).arrayBuffer();const audio = new Float32Array(buffer);const pipe = await pipeline("automatic-speech-recognition","onnx-community/whisper-base_timestamped",{dtype:{encoder_model:"fp32",decoder_model_merged:"q4"},device:"webgpu"});const result = await pipe(audio,{chunk_length_s:30,stride_length_s:5,return_timestamps:"word",language:"en"});console.log(result.chunks.map(chunk =>`${chunk.timestamp[0]} ->${chunk.timestamp[1]}${chunk.text}`))</script>
it prints:
"29.98 -> 29.98 every","29.98 -> 29.98 day","29.98 -> 29.98 style."
Timestamps are invalid and there is also far more speaking.
Changingchunk_length_s
to29
fixes the issue and produces rather valid output:
"0 -> 0.42 everyday","0.42 -> 0.86 style.","1.38 -> 1.56 - True","1.56 -> 2 classic","2 -> 2.5 delivers","2.5 -> 3.02 premium","3.02 -> 3.54 essentials","3.54 -> 3.84 built","3.84 -> 4.08 for","4.08 -> 4.42 real","4.42 -> 4.9 life.","5.4 -> 5.64 Grab","5.64 -> 6.04 yours","6.04 -> 6.36 at","6.36 -> 6.86 Target,","7.3 -> 7.78 Costco,","8.28 -> 8.3 or","8.3 -> 8.5 head","8.5 -> 8.7 to","8.7 -> 9.34 TrueClassic","9.34 -> 10.04 .com","10.04 -> 12.28 /p4p.","12.86 -> 13.08 Get","13.08 -> 13.38 hooked","13.38 -> 13.52 up","13.52 -> 13.94 today.","14.16 -> 14.24 Now","14.24 -> 14.46 before","14.46 -> 14.62 we","14.62 -> 14.82 go,","15.1 -> 15.24 just","15.24 -> 15.42 wanna","15.42 -> 15.56 give","15.56 -> 15.68 a","15.68 -> 15.86 big","15.86 -> 16.1 shout","16.1 -> 16.28 out","16.28 -> 16.76 to","16.76 -> 16.9 the","16.9 -> 17.52 CEO","17.52 -> 17.86 and","17.86 -> 18.32 founder,","18.48 -> 18.6 Ryan","18.6 -> 18.92 Frouder,","18.98 -> 19.06 for","19.06 -> 19.22 coming","19.22 -> 19.36 on","19.36 -> 19.5 our","19.5 -> 19.76 show","19.76 -> 20.4 and","20.4 -> 20.6 just","20.6 -> 20.86 showing","20.86 -> 21.08 some","21.08 -> 21.28 love.","21.46 -> 21.62 Now,","21.9 -> 22.36 let's","22.36 -> 22.46 get","22.46 -> 22.7 back","22.7 -> 23.06 to","23.06 -> 23.22 the","23.22 -> 23.54 episode","24.32 -> 24.44 I","24.44 -> 24.6 mean","24.6 -> 25.4 like","25.4 -> 25.52 I","25.52 -> 25.68 said","25.68 -> 26.12 we're","26.12 -> 26.34 going","26.34 -> 26.58 through","26.58 -> 26.82 that","26.82 -> 27.34 we're","27.34 -> 27.56 losing","27.56 -> 28.02 stars","28.02 -> 29.26 and","29.26 -> 29.38 then","29.38 -> 29.56 we","29.56 -> 29.84 kind","29.84 -> 29.98 of"
Why is 30 broken in this case? Is 29 safer in all cases or is it just coincidence?