This README shows how to calculate the xsim (multilingual similarity) error rate for a given language pair.
xSIM returns the error rate for encoding bitexts into the same embedding space i.e., given a bitext with source language embeddings X, and target language embeddings Y, xSIM aligns the embeddings from X and Y based on a margin-based similarity, and then returns the percentage of incorrect alignments.
xSIM offers three margin-based scoring options (discussed in detail here):
- distance
- ratio
- absolute
Simply run the example script bash ./eval.sh
to download a sample dataset (flores200), a sample encoder (laser2),
and calculate the sentence embeddings and the xSIM error rate for a set of (comma separated) languages.
You can also calculate xsim for encoders hosted on HuggingFace sentence-transformers. For example, to use LaBSE you can modify/add the following arguments in the sample script:
--src-encoder LaBSE
--use-hugging-face
--embedding-dimension 768
Note: for HuggingFace encoders there is no need to specify --src-spm-model
.
Import xsim
from xsim import xSIM
Calculate xsim from either numpy float arrays (e.g. np.float32) or binary embedding files
# A: numpy arrays x and y
err, nbex = xSIM(x, y)
# B: binary embedding files x and y
fp16_flag = False # set true if embeddings are saved in 16 bit
embedding_dim = 1024 # set dimension of saved embeddings
err, nbex = xSIM(
x,
y,
dim=embedding_dim,
fp16=fp16_flag
)
Error type
# A: textual-based error (allows for duplicates)
tgt_text = "/path/to/target-text-file"
err, nbex = xSIM(x, y, eval_text=tgt_text)
# B: index-based error (default)
err, nbex = xSIM(x, y)
Margin selection
# A: ratio (default)
err, nbex = xSIM(x, y)
# B: distance
err, nbex = xSIM(x, y, margin='distance')
# C: absolute
err, nbex = xSIM(x, y, margin='absolute')
Finally, to calculate the error rate simply return: 100 * err / nbex
(number of errors over total examples).