Individual vs inclusive 639-3 codes #287
ZJaume
started this conversation in
Show and tell
Replies: 1 comment
-
Hi Jaume, thank you for this information. You are right, it's difficult to find proper training data for each of the variants you have listed. That's why I decided back then to not differentiate between all those variants. I plan to add more languages to the library next year. Maybe I will rework some of the current languages and ISO codes, depending on whether I will find suitable training data. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I'm doing a comparison between Lingua and the new FastText model for NLLB with your benchmark (which BTW if you are interested, I can submit a PR with the necessary changes to run the benchmark with the new FastText model). This model uses ISO 639-3 but I found some differences between the set of codes in FastText NLLB model and Lingua. These are because FT is using always (or almost) individual language codes instead of inclusive codes, which Lingua is using in most cases.
The ideal, I think, would be able to identify all possible languages and therefore using always individual codes, but I know that this is hard especially for pluricentric languages (like Malay or Serbo-Croatian) and even more if variants are mutually intelligible. Or maybe there's no data to train a model for each variant.
So, I just wanted to point out these differences in case they are useful for you. These are the conversions I'm doing:
I do not speak any of the languages that differ and do not know the source of the test data, so cannot tell if this is 100% true. But there are test sets that the FastText model supports both variants and it is saying it is only one variant
so maybe Lingua is using inclusive codes but in practice it is only covering one of the variants of that inclusive code?
For context, these are the list of inclusive and individual codes and names from Wikipedia:
Latvian lav – inclusive code
Farsi fas – inclusive code
Azerbaijaini aze – inclusive code
There is also the case of Malay, where Lingua uses the inclusive code
msa
code but this code includes Indonesianind
. Maybe the lingua code should be Standard Malayzsm
? But this is a difficult case and may need much more work, since Wikipedia says they are close to mutually intelligible and we already know from the benchmark that tools are struggling to differentiate between them:Sorry about this "brick" of text and thank you for your tool, it is really helpful!
Beta Was this translation helpful? Give feedback.
All reactions