Skip to content

Latest commit

 

History

History
184 lines (168 loc) · 5.69 KB

README.md

File metadata and controls

184 lines (168 loc) · 5.69 KB

No Language Left Behind

LASER3 - No Language Left Behind

As part of the project No Language Left Behind (NLLB) we have developed new LASER encoders, referred to here as LASER3. Each LASER3 encoder has a particular focus language which it supports, and the full list of available LASER3 encoders can be found at the bottom of this README.

We have also included an updated version of the original LASER encoder: LASER2. This improved model supports the same languages which LASER was trained on. In order to find more details on how both the LASER2 and LASER3 encoders were trained, please see Heffernan et. al, 2022.

We also provide code to train LASER3 teacher-student models and stopes, a new powerful and flexible mining library.

Downloading encoders

To download the available encoders, please run the download_models.sh script within this directory.

bash ./download_models.sh

LASER2 and all LASER3 encoders are downloaded by default. However, downloading all LASER3 encoders may take up a lot of disk space. Therefore, you may choose to select individual LASER3 encoders to download by supplying a list of available language codes (see full list). For example: bash ./download_models.sh wol_Latn zul_Latn ...

By default, this download script will place all supported models within the calling directory.

Note: LASER3 encoders for each focus language are in the format: laser3-{language_code}.

Embedding texts

Once encoders are downloaded, you can then begin embedding texts by following the instructions here.

For example: ./LASER/tasks/embed/embed.sh [INFILE] [OUTFILE] wol_Latn

List of available LASER3 encoders

Code Language
ace_Latn Acehnese (Latin script)
aka_Latn Akan
als_Latn Tosk Albanian
amh_Ethi Amharic
asm_Beng Assamese
awa_Deva Awadhi
ayr_Latn Central Aymara
azb_Arab South Azerbaijani
azj_Latn North Azerbaijani
bak_Cyrl Bashkir
bam_Latn Bambara
ban_Latn Balinese
bel_Cyrl Belarusian
bem_Latn Bemba
ben_Beng Bengali
bho_Deva Bhojpuri
bjn_Latn Banjar (Latin script)
bod_Tibt Standard Tibetan
bug_Latn Buginese
ceb_Latn Cebuano
cjk_Latn Chokwe
ckb_Arab Central Kurdish
crh_Latn Crimean Tatar
cym_Latn Welsh
dik_Latn Southwestern Dinka
diq_Latn Southern Zaza
dyu_Latn Dyula
dzo_Tibt Dzongkha
ewe_Latn Ewe
fao_Latn Faroese
fij_Latn Fijian
fon_Latn Fon
fur_Latn Friulian
fuv_Latn Nigerian Fulfulde
gaz_Latn West Central Oromo
gla_Latn Scottish Gaelic
gle_Latn Irish
grn_Latn Guarani
guj_Gujr Gujarati
hat_Latn Haitian Creole
hau_Latn Hausa
hin_Deva Hindi
hne_Deva Chhattisgarhi
hye_Armn Armenian
ibo_Latn Igbo
ilo_Latn Ilocano
ind_Latn Indonesian
jav_Latn Javanese
kab_Latn Kabyle
kac_Latn Jingpho
kam_Latn Kamba
kan_Knda Kannada
kas_Arab Kashmiri (Arabic script)
kas_Deva Kashmiri (Devanagari script)
kat_Geor Georgian
kaz_Cyrl Kazakh
kbp_Latn Kabiyè
kea_Latn Kabuverdianu
khk_Cyrl Halh Mongolian
khm_Khmr Khmer
kik_Latn Kikuyu
kin_Latn Kinyarwanda
kir_Cyrl Kyrgyz
kmb_Latn Kimbundu
kmr_Latn Northern Kurdish
knc_Arab Central Kanuri (Arabic script)
knc_Latn Central Kanuri (Latin script)
kon_Latn Kikongo
lao_Laoo Lao
lij_Latn Ligurian
lim_Latn Limburgish
lin_Latn Lingala
lmo_Latn Lombard
ltg_Latn Latgalian
ltz_Latn Luxembourgish
lua_Latn Luba-Kasai
lug_Latn Ganda
luo_Latn Luo
lus_Latn Mizo
mag_Deva Magahi
mai_Deva Maithili
mal_Mlym Malayalam
mar_Deva Marathi
min_Latn Minangkabau (Latin script)
mlt_Latn Maltese
mni_Beng Meitei (Bengali script)
mos_Latn Mossi
mri_Latn Maori
mya_Mymr Burmese
npi_Deva Nepali
nso_Latn Northern Sotho
nus_Latn Nuer
nya_Latn Nyanja
ory_Orya Odia
pag_Latn Pangasinan
pan_Guru Eastern Panjabi
pap_Latn Papiamento
pbt_Arab Southern Pashto
pes_Arab Western Persian
plt_Latn Plateau Malagasy
prs_Arab Dari
quy_Latn Ayacucho Quechua
run_Latn Rundi
sag_Latn Sango
san_Deva Sanskrit
sat_Beng Santali
scn_Latn Sicilian
shn_Mymr Shan
sin_Sinh Sinhala
smo_Latn Samoan
sna_Latn Shona
snd_Arab Sindhi
som_Latn Somali
sot_Latn Southern Sotho
srd_Latn Sardinian
ssw_Latn Swati
sun_Latn Sundanese
swh_Latn Swahili
szl_Latn Silesian
tam_Taml Tamil
taq_Latn Tamasheq (Latin script)
tat_Cyrl Tatar
tel_Telu Telugu
tgk_Cyrl Tajik
tgl_Latn Tagalog
tha_Thai Thai
tir_Ethi Tigrinya
tpi_Latn Tok Pisin
tsn_Latn Tswana
tso_Latn Tsonga
tuk_Latn Turkmen
tum_Latn Tumbuka
tur_Latn Turkish
twi_Latn Twi
tzm_Tfng Central Atlas Tamazight
uig_Arab Uyghur
umb_Latn Umbundu
urd_Arab Urdu
uzn_Latn Northern Uzbek
vec_Latn Venetian
war_Latn Waray
wol_Latn Wolof
xho_Latn Xhosa
ydd_Hebr Eastern Yiddish
yor_Latn Yoruba
zsm_Latn Standard Malay
zul_Latn Zulu