NADI Shared Task 2020 for Arabic dialect classification site

Arabic has a widely varying collection of dialects. Many of these dialects remain under-studied due to the rarity of resources. The goal of the shared task is to alleviate this bottleneck in the context of fine-grained Arabic dialect identification. Dialect identification is the task of classifying the dialect of the tweet writer given the tweet itself.

We present our model for Arabic dialect classification that ranked fourth in WANLP 2020 leaderboard

By running the train.py file you are able to start the training process.

There are multiple params that can be changed in config_train.txt, a detailed explanation will be provided later on

Summary

Using pre-trained AraBert, we first proceeded by fine-tuning the model applying masked language modeling on Arabic tweets as shown in the image below. This is also known as domain adaptation.
then we added a classification layer and retrained our fine-tuned model to distinguish different Arabic dialects.

credit

hugging face repo

farasa seg rwdepo repo

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
utils		utils
FarasaSegmenterJar.jar		FarasaSegmenterJar.jar
README.md		README.md
data_loader.py		data_loader.py
models.py		models.py
requirements.txt		requirements.txt
run_model.py		run_model.py
train.py		train.py
train_config.txt		train_config.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NADI Shared Task 2020 for Arabic dialect classification site

Summary

credit

About

Releases

Packages

Languages

abdelrahman-wael/Arabic-Dialect-Classification-Nadi-Shared-Task

Folders and files

Latest commit

History

Repository files navigation

NADI Shared Task 2020 for Arabic dialect classification site

Summary

credit

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages