VERSA (Versatile Evaluation of Speech and Audio) is a toolkit dedicated to collecting evaluation metrics in speech and audio quality. Our goal is to provide a comprehensive connection to the cutting-edge techniques developed for evaluation. The toolkit is also tightly integrated into ESPnet.
Colab Demonstration at Interspeech2024 Tutorial
The base-installation is as easy as follows:
git clone https://github.com/shinjiwlab/versa.git
cd versa
pip install .
or
pip install git+https://github.com/shinjiwlab/versa.git
As for collection purposes, VERSA instead of re-distributing the model, we try to align as much to the original API provided by the algorithm developer. Therefore, we have many dependencies. We try to include as many as default, but there are cases where the toolkit needs specific installation requirements. Please refer to our list-of-metric section for more details on whether the metrics are automatically included or not. If not, we provide an installation guide or installers in tools
.
python versa/test/test_script.py
Simple usage case for a few samples.
# direct usage
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1 \
--pred test/test_samples/test2 \
--output_file test_result
# with scp-style input
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result
# with kaldi-ark style
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result \
--io kaldi
# For text information
python versa/bin/scorer.py \
--score_config egs/separate_metrics/wer.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result \
--text test/test_samples/text
Use launcher with slurm job submissions
# use the launcher
# Option1: with gt speech
./launch.sh \
<pred_speech_scp> \
<gt_speech_scp> \
<score_dir> \
<split_job_num>
# Option2: without gt speech
./launch.sh \
<pred_speech_scp> \
None \
<score_dir> \
<split_job_num>
# aggregate the results
cat <score_dir>/result/*.result.cpu.txt > <score_dir>/utt_result.cpu.txt
cat <score_dir>/result/*.result.gpu.txt > <score_dir>/utt_result.gpu.txt
# show result
python scripts/show_result.py <score_dir>/utt_result.cpu.txt
python scripts/show_result.py <score_dir>/utt_result.gpu.txt
Access egs/*.yaml
for different config for differnt setups.
We include [ ] and [x] to mark if the metirc is auto-installed in versa.
Number | Metric Name (Auto-Install) | Key in config | Key in report | Code Source | References |
---|---|---|---|---|---|
1 | Mel Cepstral Distortion (MCD) [x] | mcd_f0 | mcd | espnet and s3prl-vc | paper |
2 | F0 Correlation [x] | mcd_f0 | f0_corr | espnet and s3prl-vc | paper |
3 | F0 Root Mean Square Error [x] | mcd_f0 | f0_rmse | espnet and s3prl-vc | paper |
4 | Signal-to-infererence Ratio (SIR) [x] | signal_metric | sir | espnet | - |
5 | Signal-to-artifact Ratio (SAR) [x] | signal_metric | sar | espnet | - |
6 | Signal-to-distortion Ratio (SDR) [x] | signal_metric | sdr | espnet | - |
7 | Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) [x] | signal_metric | ci-sdr | ci_sdr | paper |
8 | Scale-invariant signal-to-noise ratio (SI-SNR) [x] | signal_metric | si-snr | espnet | paper |
9 | Perceptual Evaluation of Speech Quality (PESQ) [x] | pesq | pesq | pesq | paper |
10 | Short-Time Objective Intelligibility (STOI) [x] | stoi | stoi | pystoi | paper |
11 | Speech BERT Score [x] | discrete_speech | speech_bert | discrete speech metric | paper |
13 | Discrete Speech Token Edit Distance [x] | discrete_speech | speech_token_distance | discrete speech metric | paper |
14 | UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) [x] | pseudo_mos | utmos | speechmos | paper |
15 | Deep Noise Suppression MOS Score of P.835 (DNSMOS) [x] | pseudo_mos | dnsmos_overall | speechmos (MS) | paper |
16 | Deep Noise Suppression MOS Score of P.808 (DNSMOS) [x] | pseudo_mos | dnsmos_p808 | speechmos (MS) | paper |
17 | Packet Loss Concealment-related MOS Score (PLCMOS) [x] | pseudo_mos | plcmos | speechmos (MS) | paper |
18 | Virtual Speech Quality Objective Listener (VISQOL) [ ] | visqol | visqol | google-visqol | paper |
19 | Speaker Embedding Similarity [x] | speaker | spk_similarity | espnet | paper |
20 | PESQ in TorchAudio-Squim [x] | squim_no_ref | torch_squim_pesq | torch_squim | paper |
21 | STOI in TorchAudio-Squim [x] | squim_no_ref | torch_squim_stoi | torch_squim | paper |
22 | SI-SDR in TorchAudio-Squim [x] | squim_no_ref | torch_squ | ||
12 | Discrete Speech BLEU Score [x] | diim_si_sdr | torch_squim | paper | |
23 | MOS in TorchAudio-Squim [x] | squim_ref | torch_squim_mos | torch_squim | paper |
24 | Singing voice MOS [x] | singmos | singmos | singmos | paper |
25 | Log-Weighted Mean Square Error [x] | log_wmse | log_wmse | log_wmse | |
26 | Dynamic Time Warping Cost Metric [ ] | warpq | warpq | WARP-Q | paper |
27 | Sheet SSQA MOS Models [x] | sheet_ssqa | sheet_ssqa | Sheet | paper |
28 | ESPnet Speech Recognition-based Error Rate [x] | espnet_wer | espnet_wer | ESPnet | paper |
29 | ESPnet-OWSM Speech Recognition-based Error Rate [x] | owsm_wer | owsm_wer | ESPnet | paper |
30 | OpenAI-Whisper Speech Recognition-based Error Rate [x] | whisper_wer | whisper_wer | Whisper | paper |
31 | UTMOSv2: UTokyo-SaruLab MOS Prediction System [ ] | utmosv2 | utmosv2 | UTMOSv2 | paper |
32 | Speech Contrastive Regression for Quality Assessment with reference (ScoreQ) [ ] | scoreq_ref | scoreq_ref | ScoreQ | paper |
33 | Speech Contrastive Regression for Quality Assessment without reference (ScoreQ) [ ] | scoreq_nr | scoreq_nr | ScoreQ | paper |
A few more in verifying/progresss |
To implement a new metric to versa includes the following steps:
You may add the metric implementation in the following sub-directories (versa/corpus_metrics
, versa/utterance_metrics
, versa/sequence_metrics
). Specifically,
- corpus_metrics: works for metrics that need the whole corpora to compute the metric (e.g., FAD or WER).
- utterance_metrics: works for utterance-level metrics
- sequence_metrics (will be deprecated in later versions and merged to utterance_metrics): stands for metrics that need comparing two feature sequences.
The typical format of the metric setup includes two functions, one for model setup, and the other for inference. Please refer to versa/utterance/metrics/speaker.py
for an example of the implementation.
For special cases where the model setup is simple or not needed, we can simplify only one inference function without the setup function as exemplified in versa/utterance_metrics/stoi.py
Special note:
- Please consider adding a simple test function at the end of the implementation.
- For consistency, we will have some fixed naming conventions to follow:
- For the setup function, we will have an argument of
use_gpu
which is default set toFalse
. - For the inference function, the previous preprocessor can provide five arguments so far (If you need more, please contact Jiatong Shi for further discussion on the interface):
- model: the inference model to use
- pred_x: audio signal to be evaluated
- fs: audio signal's sampling rate
- gt_x: [optional] the reference audio signal (automatically handled in the previous parts, the reference signal should also have the same sampling rate as the target signal to be evaluated)
- ref_text: [optional] additional text information. It can be either the transcription for WER or the text description for audio signals.
- For the setup function, we will have an argument of
- Toolkit development: to link the toolkit modeling to other implementations, the
- We recommend using the original tool/interface as much as possible if they can already be formatted into the current interface. However, if it is not, we recommend the following hacking options to link their methods to VERSA. This option also works for those packages that need very specific versions of the dependency packages :
-
- fork their repo
-
- add customized interface
-
- add localized install options to
tools
- add localized install options to
-
- We recommend using the original tool/interface as much as possible if they can already be formatted into the current interface. However, if it is not, we recommend the following hacking options to link their methods to VERSA. This option also works for those packages that need very specific versions of the dependency packages :
For the second step, please add your metrics to the scoring list in versa/scorer_shared.py
. Notably, you are expected to add the new metrics in both load_score_modules()
and use_score_modules()
.
At this step, please define a unique key for your metric to differentiate it from others. By referring to the key, you can declare the setup function in load_score_modules()
and the inference function in use_score_modules()
. Please refer to the existing examples so that they are following the same setup.
At this step, the major implementation has been done and we mainly focus on the docs, test functions, examples, and code-wrapping up.
For Docs, please add your metrics to the README.md
(List of Metrics Section). If the metrics need external tools from installers at tools
, please include that with the [ ]
mark in the first field (column).
For Tests, please add the local test functions at the corresponding metrics scripts temporarily (we will enable CI test in later stages).
For Examples, please put a separate yaml
style configuration file in egs/separate_metrics
following other examples.
For Code-wrapping up, we highly recommend you use black
and isort
to format your added scripts.