Taqyim تقييم

A library for evaluting Arabic NLP datasets on chatgpt models.

Installation

pip install -e .

Example

import taqyim as tq
pipeline = tq.Pipeline(
    eval_name="ajgt-test",
    dataset_name="arbml/ajgt_ubc_split",
    task_class="classification",
    task_description= "Sentiment Analysis",
    input_column_name="content",
    target_column_name="label",
    prompt="Predict the sentiment",
    api_key="<openai-key>",
    train_split="train",
    test_split="test",
    model_name="gpt-3.5-turbo-0301",
    max_samples=1,)

# run the evaluation
pipeline.run()

# show the output data frame
pipeline.show_results()

# show the eval metrics
pipeline.get_final_report()

Run on custom dataset

custom_dataset.ipynb has a complete example on how to run evaluation on a custom dataset.

parameters

eval_name choose an eval name
task_class class name from supported class names
task_description short description about the task
dataset_name dataset name for evaluation
subset If the dataset has subset
train_split train split name in the dataset
test_splittest split name in the dataset
input_column_name input column name in the dataset
target_column_name target column name in the dataset
prompt the prompt to be fed to the model
task_description short string explaining the task
api_key api key from keys
preprocessing_fn function used to process inputs and targets
threads number of threads used to fetch the api
threads_timeout thread timeout
max_samples max samples used for evaluation from the dataset
model_name choose either gpt-3.5-turbo-0301 or gpt-4-0314
temperature temperature passed to the model between 0 and 2, higher temperature means more random results
num_few_shot number of fewshot samples to be used for evaluation
resume_from_record if True it will continue the run from the sample that has no results.
seed seed to redproduce the results

Supported Classes and Tasks

Classification classification tasks see classification.py.
Pos_Tagging part of speech tagging tasks pos_tagging.py.
Translation machine translation translation.py.
Summarization machine translation summarization.py.
MCQ multiple choice question answering mcq.py.
Rating rating multiple LLMs outputs rating.py.
Diacritization machine translation diacritization.py.

Evaluation on Arabic Tasks

Tasks	Dataset	Size	Metrics	GPT-3.5	GPT-4	SoTA
Summarization	EASC	153	RougeL	23.5	18.25	13.3
PoS Tagging	PADT	680	Accuracy	75.91	86.29	96.83
classification	AJGT	360	Accuracy	86.94	90.30	96.11
transliteration	BOLT Egyptian✢	6,653	BLEU	13.76	27.66	65.88
translation	UN v1	4,000	BLEU	35.05	38.83	53.29
Paraphrasing	APB	1,010	BLEU	4.295	6.104	17.52
Diacritization	WikiNews✢✢	393	WER/DER	32.74/10.29	38.06/11.64	4.49/1.21

✢ BOLT requires LDC subscription

✢✢ WikiNews not public, contact authors to access the dataset

@misc{alyafeai2023taqyim,
      title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models}, 
      author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
      year={2023},
      eprint={2306.16322},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Taqyim تقييم

Installation

Example

Run on custom dataset

parameters

Supported Classes and Tasks

Evaluation on Arabic Tasks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Taqyim تقييم

Installation

Example

Run on custom dataset

parameters

Supported Classes and Tasks

Evaluation on Arabic Tasks