Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Can you tell me what is the fastest solution for whisper#703

Unanswered
sh1man999 asked this question inQ&A
Discussion options

Can you tell me what is the fastest solution for whisper. Are there any models that do not have the 30-second audios limitation

https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/whisper/whisper_encoder_forward_monkey_patch.py#L15

You must be logged in to vote

Replies: 1 comment 12 replies

Comment options

Do you want to recognize audios longer or shorter than 30s?

You must be logged in to vote
12 replies
@sh1man999
Comment options

You cannot find a non-streaming model anywhere accepting audio files of any length.

Why not, you say

Using Fine-tuned Whisper

Official whisper models only accept 30-second audios. To improve the throughput, you could fine-tune the whisper model to remove the 30 seconds restriction. Seeexamples.

We prepared twoChinese fine-tuned whisper TensorRT-LLM weights repo. They could be directly used fromhere.

@csukuangfj
Comment options

Theoretically, if your RAM is infinite, then it is possible.

Removing the 30s constraint means you can input audio files less than 30s without padding. It does not mean you.can input files of arbitrary length.

By the way, is there any disadvantage of using a vad model here?

@sh1man999
Comment options

Теоретически, если у вас оперативная память бесконечна, то это возможно.

Снятие ограничения на 30 секунд означает, что вы можете вводить аудиофайлы менее 30 секунд без заполнения. Это не означает, что вы можете вводить файлы произвольной длины.

Кстати, есть ли здесь какие-то недостатки использования модели vad?

Okay, I realized that long files aren't possible.
I haven't noticed any flaws in the VAD model, I didn’t use VAD enough to properly assess the audio.

How do i enter audio files less than 30 seconds without filling in ?

@csukuangfj
Comment options

Padding is invisible for users if the input audio is less than 30s. it is an implementation detail.

@sh1man999
Comment options

Padding is invisible for users if the input audio is less than 30s. it is an implementation detail.

Faytuning the model has improved your results ?

strings x = (x + self.positional_embedding[:x.shape[1], :]).to(x.dtype)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
2 participants
@sh1man999@csukuangfj

[8]ページ先頭

©2009-2025 Movatter.jp