k2-fsa/sherpaPublic

NotificationsYou must be signed in to change notification settings
Fork131
Star811

Can you tell me what is the fastest solution for whisper#703

Unanswered

sh1man999 asked this question inQ&A

sh1man999

Jan 21, 2025

· 1 comments· 12 replies

Return to top

Discussion options

sh1man999
Jan 21, 2025

Can you tell me what is the fastest solution for whisper. Are there any models that do not have the 30-second audios limitation

https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/whisper/whisper_encoder_forward_monkey_patch.py#L15

You must be logged in to vote

Replies: 1 comment 12 replies

Comment options

csukuangfj
Jan 22, 2025
Maintainer

Do you want to recognize audios longer or shorter than 30s?

You must be logged in to vote

12 replies

Comment options

sh1man999 Jan 24, 2025
Author

You cannot find a non-streaming model anywhere accepting audio files of any length.

Why not, you say

Using Fine-tuned Whisper

Official whisper models only accept 30-second audios. To improve the throughput, you could fine-tune the whisper model to remove the 30 seconds restriction. Seeexamples.

We prepared twoChinese fine-tuned whisper TensorRT-LLM weights repo. They could be directly used fromhere.

Comment options

csukuangfj Jan 24, 2025
Maintainer

Theoretically, if your RAM is infinite, then it is possible.

Removing the 30s constraint means you can input audio files less than 30s without padding. It does not mean you.can input files of arbitrary length.

By the way, is there any disadvantage of using a vad model here?

Comment options

sh1man999 Jan 24, 2025
Author

Теоретически, если у вас оперативная память бесконечна, то это возможно.
Снятие ограничения на 30 секунд означает, что вы можете вводить аудиофайлы менее 30 секунд без заполнения. Это не означает, что вы можете вводить файлы произвольной длины.
Кстати, есть ли здесь какие-то недостатки использования модели vad?

Okay, I realized that long files aren't possible.
I haven't noticed any flaws in the VAD model, I didn’t use VAD enough to properly assess the audio.

How do i enter audio files less than 30 seconds without filling in ?

Comment options

csukuangfj Jan 24, 2025
Maintainer

Padding is invisible for users if the input audio is less than 30s. it is an implementation detail.

Comment options

sh1man999 Jan 27, 2025
Author

Padding is invisible for users if the input audio is less than 30s. it is an implementation detail.

Faytuning the model has improved your results ?

strings x = (x + self.positional_embedding[:x.shape[1], :]).to(x.dtype)

Movatterモバイル変換

Can you tell me what is the fastest solution for whisper#703

Uh oh!

sh1man999Jan 21, 2025

Replies: 1 comment· 12 replies

Uh oh!

csukuangfjJan 22, 2025 Maintainer

Uh oh!

sh1man999Jan 24, 2025 Author

Uh oh!

csukuangfjJan 24, 2025 Maintainer

Uh oh!

Uh oh!

sh1man999Jan 24, 2025 Author

Uh oh!

csukuangfjJan 24, 2025 Maintainer

Uh oh!

sh1man999Jan 27, 2025 Author

Uh oh!

sh1man999
Jan 21, 2025

Replies: 1 comment 12 replies

csukuangfj
Jan 22, 2025
Maintainer

sh1man999 Jan 24, 2025
Author

csukuangfj Jan 24, 2025
Maintainer

sh1man999 Jan 24, 2025
Author

csukuangfj Jan 24, 2025
Maintainer

sh1man999 Jan 27, 2025
Author