GitHub

ZeroSumEval is a framework for evaluating the reasoning abilities of Large Language Models (LLMs) using zero-sum multiplayer simulations. ZSEval uses DSPy for automatic prompt optimization to ensure evaluations are fair.

Overview

ZeroSumEval aims to create a robust evaluation framework for LLMs using competitive scenarios. Instead of fixed evaluation benchmarks or model-based judging, ZSEval uses multiplayer simulations/games with clear win conditions to pit models against each other.

The framework tests various model capabilities, including knowledge, reasoning, and planning. In addition, ZSEval uses DSPy optimization to test the self-improvement capability of models and ensure the competition between models is fair.

The eval suite consists of a growing number of simulations, including text-based challenges, board games, and Capture The Flag (CTF) competitions.

Key features:

One-click evals on the existing suite of games
Easily extendable abstractions for new game implementations
Integration with DSPy for automated prompt optimization
Comprehensive logging and analysis tools

Project Structure

The project is organized as follows:

zero_sum_eval/: Main package containing the core framework
- games/: Individual game implementations
- managers/: Game and match management classes
data/: Game-specific data and examples
configs/: Configuration files for different games and scenarios
run_game.py: Script to run individual games
run_matches.py: Script to run a series of matches

Installation

Clone the repository:

git clone https://github.com/your-username/ZeroSumEval.git
cd ZeroSumEval

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

To run a game:

python run_game.py -c configs/chess.yaml

To run a series of matches:

python run_matches.py -c configs/mathquiz.yaml

Games

ZeroSumEval currently supports the following games:

Chess
Math Quiz
Gandalf (Password Guessing)
PyJail (Capture The Flag)

Each game is implemented as a separate module in the zero_sum_eval/games/ directory.

Configuration

Game configurations are defined in YAML files located in the configs/ directory. These files specify:

Logging settings
Game parameters
Player configurations
LLM settings

Example Configuration (chess.yaml):

logging:
  output_dir: ../output/chess_game
manager:
  args:
    max_rounds: 200
    win_conditions: 
      - Checkmate
    draw_conditions:
      - Stalemate
      - ThreefoldRepetition
      - FiftyMoveRule
      - InsufficientMaterial
game:
  name: chess
  players:
    - name: chess_player
      args:
        id: gpt4 white
        roles: 
          - White
        optimize: false
        dataset: chess_dataset
        dataset_args:
          filename: ./data/chess/stockfish_examples.jsonl
          role: White
        optimizer: MIPROv2
        optimizer_args:
          num_candidates: 5
          minibatch_size: 20
          minibatch_full_eval_steps: 10
        compilation_args:
          max_bootstrapped_demos: 1
          max_labeled_demos: 1
        metric: chess_move_validation_metric
        lm:
          type: AzureOpenAI
          args:
            api_base: https://allam-swn-gpt-01.openai.azure.com/
            api_version: 2023-07-01-preview
            deployment_id: gpt-4o-900ptu
            max_tokens: 800
            temperature: 0.8
            top_p: 0.95
            frequency_penalty: 0
            presence_penalty: 0
        max_tries: 5
    - name: chess_player
      args:
        id: gpt4 black
        roles: 
          - Black
        optimize: false
        dataset: chess_dataset
        dataset_args:
          filename: ./data/chess/stockfish_examples.jsonl
          role: Black
        optimizer: MIPROv2
        optimizer_args:
          num_candidates: 5
          minibatch_size: 20
          minibatch_full_eval_steps: 10
        compilation_args:
          max_bootstrapped_demos: 1
          max_labeled_demos: 1
        metric: chess_move_validation_metric
        lm:
          type: AzureOpenAI
          args:
            api_base: https://allam-swn-gpt-01.openai.azure.com/
            api_version: 2023-07-01-preview
            deployment_id: gpt-4o-900ptu
            max_tokens: 800
            temperature: 0.8
            top_p: 0.95
            frequency_penalty: 0
            presence_penalty: 0
        max_tries: 5

Contributing

Contributions to ZeroSumEval are welcome! Please open a pull request

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
analysis_scripts		analysis_scripts
configs		configs
data		data
experiments		experiments
leaderboard_app		leaderboard_app
zero_sum_eval		zero_sum_eval
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
logo.png		logo.png
requirements.txt		requirements.txt
run_all_games.py		run_all_games.py
run_game.py		run_game.py
run_matches.py		run_matches.py
run_pool_matches.py		run_pool_matches.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Overview

Project Structure

Installation

Usage

Games

Configuration

Contributing

License

About

Releases

Packages

Contributors 6

Languages

haidark/ZeroSumEval

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Project Structure

Installation

Usage

Games

Configuration

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages