Skip to content

VAD Parameters

jhj0517 edited this page Jun 26, 2024 · 1 revision

Currently Silero VAD is only implemented with faster-whisper. So Silero VAD is only usable when you use faster-whisper.

VAD Parameters

Parameter Description
vad_filter The VAD filter is disabled by default, so you need to set it to true if you want to use it.
threshold Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
If it has a low value, it will be sensitive to small sounds and not treat them as a silent part.
min_speech_duration_ms Final speech chunks shorter min_speech_duration_ms are thrown out.
max_speech_duration_s Maximum duration of speech chunks in seconds. Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent aggressive cutting. Otherwise, they will be split aggressively just before max_speech_duration_s.
min_silence_duration_ms In the end of each speech chunk wait for min_silence_duration_ms before separating it
window_size_samples Audio chunks of window_size_samples size are fed to the silero VAD model.
WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate.
Values other than these may affect model performance!!
speech_pad_ms Final speech chunks are padded by speech_pad_ms each side
Clone this wiki locally