You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What are the main differences in large-v1, v2 and v3 models? They all seem to be nearly the same exact size so I am curious how I can see what the differences are?
The text was updated successfully, but these errors were encountered:
They give you different results for the same input.
I haven't run any GOOD benchmarks in terms of WER (Word Error Rate, Lower is better.) between like large-v1 vs large-v2 etc, so here's my personal experience:
large-v1 is the first large model and in most cases it gives a worse result than large-v2.
A bit controversial is large-v2 vs large-v3.
In my personal experience, large-v3 often causes really bad hallucinations if the audio has a little bit of noise.
( See #152 (comment) and openai/whisper#2378 for more info )
But if the audio is really clean without much noise, like an ASR benchmark dataset, large-v3 will give you more accurate timestamps in my experience.
And if you don't care much about accurate result, you can consider using large-v3-turbo because it's lighter, faster, with really minor result quality downgrade than large-v3.
You can see how to use it in the Web UI at #309 (comment).
TL DR;
If your audio is clean (no noise), use large-v3.
If not, use large-v2.
If you're OK with a slight loss of quality compared to large-v3, use large-v3-turbo for faster transcription.
What are the main differences in large-v1, v2 and v3 models? They all seem to be nearly the same exact size so I am curious how I can see what the differences are?
The text was updated successfully, but these errors were encountered: