📣 [2024/10/25]: We release all 20,000 base risk prompts and 200,000 corresponding attack prompts (Version-0.1.2). We also update 🏆 LeaderBoard_v0.1.2 with new evaluation results including GPT-4 and other models. 🎉 S-Eval has achieved about 7,000 total views and about 2,000 total downloads across multiple platforms. 🎉
📣 [2024/06/17]: We further release 10,000 base risk prompts and 100,000 corresponding attack prompts (Version-0.1.1). If you require automatic safety evaluations, please feel free to submit a request via Issues or contact us by Email.
📣 [2024/05/31]: We release 20,000 corresponding attack prompts.
📣 [2024/05/23]: We publish our paper on ArXiv and first release 2,000 base risk prompts. The evaluation results in our experiments are shown in the HuggingFace 🏆 Leaderboard_v0.1.1. You can also download the benchmark from the HuggingFace Dataset.
To maintain this benchmark for satisfying evolving needs from addressing future evaluation challenges, we warmly welcome submissions of new risks and attacks to continuously augment our risk taxonomy and adversarial scenarios through this project! S-Eval is designed to be a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. So far, S-Eval has 220,000 evaluation prompts in total (and is still in active expansion), including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks. These test prompts are generated based on a comprehensive and unified risk taxonomy, specifically designed to encompass all crucial dimensions of LLM safety evaluation and meant to accurately reflect the varied safety levels of LLMs across these risk dimensions. More details on the construction of the test suite including model-based test generation, selection and the expert critique LLM can be found in our paper.
The statistics on the risk dimensions and number of test prompts in one language of S-Eval are shown in the following table, which are the same for Chinese and English.
Risk Dimension | Risk Category | # Base | # Attack |
---|---|---|---|
Crimes and Illegal Activities (CI) | Pornographic Contraband | 533 | 5330 |
Drug Crimes | 432 | 4320 | |
Dangerous Weapons | 487 | 4870 | |
Property Infringement | 400 | 4000 | |
Economic Crimes | 496 | 4960 | |
Cybersecurity (CS) | Access Control | 228 | 2280 |
Hacker Attack | 209 | 2090 | |
Malicious Code | 313 | 3130 | |
Physical Security | 252 | 2520 | |
Data Privacy (DP) | Personal Privacy | 668 | 6680 |
Commercial Secret | 674 | 6740 | |
Ethics and Morality (EM) | Social Ethics | 493 | 4930 |
Science Ethics | 507 | 5070 | |
Physical and Mental Health (PM) | Physical Harm | 519 | 5190 |
Mental Health | 483 | 4830 | |
Hate Speech (HS) | Abusive Curses | 296 | 2960 |
Cyberbullying | 303 | 3030 | |
Defamation | 292 | 2920 | |
Threaten and Intimidate | 302 | 3020 | |
Extremism (EX) | Violent Terrorist Activities | 207 | 2070 |
Social Disruption | 366 | 3660 | |
Extremist Ideological Trends | 524 | 5240 | |
Inappropriate Suggestions (IS) | Finance | 341 | 3410 |
Medicine | 338 | 3380 | |
Law | 337 | 3370 | |
Total | - | 10000 | 100000 |
For each method, we calculate balanced accuracy as well as precision and recall for every label (i.e. safe/unsafe). The bold value indicates the best.
Method | Chinese | English | ||||
---|---|---|---|---|---|---|
ACC | Precision | Recall | ACC | Precision | Recall | |
Rule Matching | 74.12 | 78.46/74.44 | 87.08/61.15 | 70.19 | 69.42/72.01 | 77.54/62.84 |
GPT-4-Turbo | 78.00 | 79.19/94.07 | 97.74/58.27 | 72.36 | 66.84/93.83 | 97.12/47.60 |
LLaMA-Guard-2 | 76.23 | 77.68/95.37 | 98.38/57.07 | 69.32 | 64.30/93.81 | 97.50/41.13 |
Ours | 92.23 | 93.36/92.37 | 95.48/88.98 | 88.23 | 86.36/90.97 | 92.32/84.13 |
If our work is useful for your own, please cite us with the following BibTex entry:
@article{yuan2024seval,
title={S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models},
author={Xiaohan Yuan and Jinfeng Li and Dongxia Wang and Yuefeng Chen and Xiaofeng Mao and Longtao Huang and Hui Xue and Wenhai Wang and Kui Ren and Jingyi Wang},
journal={arXiv preprint arXiv:2405.14191},
year={2024}
}