A library for evaluting Arabic NLP datasets on chatgpt models.
pip install -e .
import taqyim as tq
pipeline = tq.Pipeline(
eval_name="ajgt-test",
dataset_name="arbml/ajgt_ubc_split",
task_class="classification",
task_description= "Sentiment Analysis",
input_column_name="content",
target_column_name="label",
prompt="Predict the sentiment",
api_key="<openai-key>",
train_split="train",
test_split="test",
model_name="gpt-3.5-turbo-0301",
max_samples=1,)
# run the evaluation
pipeline.run()
# show the output data frame
pipeline.show_results()
# show the eval metrics
pipeline.get_final_report()
custom_dataset.ipynb has a complete example on how to run evaluation on a custom dataset.
eval_name
choose an eval nametask_class
class name from supported class namestask_description
short description about the taskdataset_name
dataset name for evaluationsubset
If the dataset has subsettrain_split
train split name in the datasettest_split
test split name in the datasetinput_column_name
input column name in the datasettarget_column_name
target column name in the datasetprompt
the prompt to be fed to the modeltask_description
short string explaining the taskapi_key
api key from keyspreprocessing_fn
function used to process inputs and targetsthreads
number of threads used to fetch the apithreads_timeout
thread timeoutmax_samples
max samples used for evaluation from the datasetmodel_name
choose eithergpt-3.5-turbo-0301
orgpt-4-0314
temperature
temperature passed to the model between 0 and 2, higher temperature means more random resultsnum_few_shot
number of fewshot samples to be used for evaluationresume_from_record
ifTrue
it will continue the run from the sample that has no results.seed
seed to redproduce the results
Classification
classification tasks see classification.py.Pos_Tagging
part of speech tagging tasks pos_tagging.py.Translation
machine translation translation.py.Summarization
machine translation summarization.py.MCQ
multiple choice question answering mcq.py.Rating
rating multiple LLMs outputs rating.py.Diacritization
machine translation diacritization.py.
Tasks | Dataset | Size | Metrics | GPT-3.5 | GPT-4 | SoTA |
---|---|---|---|---|---|---|
Summarization | EASC | 153 | RougeL | 23.5 | 18.25 | 13.3 |
PoS Tagging | PADT | 680 | Accuracy | 75.91 | 86.29 | 96.83 |
classification | AJGT | 360 | Accuracy | 86.94 | 90.30 | 96.11 |
transliteration | BOLT Egyptian✢ | 6,653 | BLEU | 13.76 | 27.66 | 65.88 |
translation | UN v1 | 4,000 | BLEU | 35.05 | 38.83 | 53.29 |
Paraphrasing | APB | 1,010 | BLEU | 4.295 | 6.104 | 17.52 |
Diacritization | WikiNews✢✢ | 393 | WER/DER | 32.74/10.29 | 38.06/11.64 | 4.49/1.21 |
✢ BOLT requires LDC subscription
✢✢ WikiNews not public, contact authors to access the dataset
@misc{alyafeai2023taqyim,
title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models},
author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
year={2023},
eprint={2306.16322},
archivePrefix={arXiv},
primaryClass={cs.CL}
}