Inspired by paper https://arxiv.org/abs/2306.05685
Using Qwen-1.5-72B (through vLLM) as LLM judge to evaluate models on indic language hindi/english/hinglish
Evaluation Dataset : https://huggingface.co/datasets/manishiitg/human_eval
Final Results are published at https://huggingface.co/datasets/manishiitg/llm_judge
To evaluate your own model add it add https://github.com/manishiitg/IndicLMJudge/blob/main/scripts/indic_eval/common_vars.sh
and simply bash scripts/lmjudge.sh
Model | Language | Score | No# Questions |
---|---|---|---|
Qwen/Qwen1.5-72B-Chat-AWQ | hi | 8.3722 | 562 |
Qwen/Qwen1.5-14B-Chat | hi | 8.2561 | 561 |
google/gemma-7b-it | hi | 7.8930 | 561 |
Qwen/Qwen1.5-7B-Chat | hi | 7.8518 | 562 |
manishiitg/open-aditi-hi-v3 | hi | 7.7464 | 562 |
manishiitg/open-aditi-hi-v4 | hi | 7.5537 | 562 |
manishiitg/open-aditi-hi-v2 | hi | 7.2536 | 562 |
teknium/OpenHermes-2.5-Mistral-7B | hi | 7.2240 | 562 |
ai4bharat/Airavata | hi | 6.9355 | 550 |
01-ai/Yi-34B-Chat | hi | 6.5692 | 562 |
manishiitg/open-aditi-hi-v1 | hi | 4.6521 | 562 |
sarvamai/OpenHathi-7B-Hi-v0.1-Base | hi | 4.2417 | 606 |
Qwen/Qwen1.5-4B-Chat | hi | 4.0970 | 562 |
Model | Language | Score | No# Questions |
---|---|---|---|
Qwen/Qwen1.5-14B-Chat | en | 9.1956 | 362 |
Qwen/Qwen1.5-72B-Chat-AWQ | en | 9.1577 | 362 |
Qwen/Qwen1.5-7B-Chat | en | 9.1503 | 362 |
01-ai/Yi-34B-Chat | en | 9.1373 | 362 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | en | 9.1340 | 362 |
teknium/OpenHermes-2.5-Mistral-7B | en | 9.0006 | 362 |
manishiitg/open-aditi-hi-v3 | en | 8.9069 | 362 |
manishiitg/open-aditi-hi-v4 | en | 8.9064 | 362 |
google/gemma-7b-it | en | 8.7945 | 362 |
Qwen/Qwen1.5-4B-Chat | en | 8.7224 | 362 |
manishiitg/open-aditi-hi-v2 | en | 8.4343 | 362 |
ai4bharat/Airavata | en | 7.3923 | 362 |
manishiitg/open-aditi-hi-v1 | en | 6.6413 | 361 |
sarvamai/OpenHathi-7B-Hi-v0.1-Base | en | 5.9009 | 318 |