Agents

We evaluated three open-source agents. We slightly modified each agent to elicit better capabilities. We also developed a "dummy" agent, used to check that the environment is configured correctly. Each agent, alongside the link to our fork, is listed below. Each agent has an associated ID which we use to identify it within our repo.

Agent	ID	Fork
dummy	dummy	N/A
AIDE	aide	https://github.com/thesofakillers/aideml
MLAgentBench	mlagentbench	https://github.com/JunShern/MLAgentBench
OpenHands	opendevin	https://github.com/thesofakillers/OpenHands

Prerequisites

If you want to run these agents locally:

Install Docker
Install Sysbox. See Security below for more information
(Optional) Install NVIDIA Container Toolkit to run agents with GPUs

To build an image for an agent with ID <agent>, run:

export SUBMISSION_DIR=/home/submission
export LOGS_DIR=/home/logs
export CODE_DIR=/home/code
export AGENT_DIR=/home/agent

docker build --platform=linux/amd64 -t <agent> agents/<agent>/ --build-arg SUBMISSION_DIR=$SUBMISSION_DIR --build-arg LOGS_DIR=$LOGS_DIR --build-arg CODE_DIR=$CODE_DIR --build-arg AGENT_DIR=$AGENT_DIR

Running agents

Our run_agent.py script allows you to run agents locally on a given set of competitions. In the experiments/splits/ directory, we have several files, each containing a set of competition IDs. The experiments/splits/all.txt file contains all competitions. The experiments/splits/spaceship-titanic.txt split just contains the Spaceship Titanic competition, which is useful for testing. For example, to run the dummy agent on the Spaceship Titanic competition, you can run:

python run_agent.py --agent-id dummy --competition-set experiments/splits/spaceship-titanic.txt

Running run_agent.py will creates a "run group" directory in the runs/ directory. The run group directory will contain a subdirectory for each competition that the agent was evaluated on, containing the agent's logs, code, and submission. A metadata.json file will be created on finish within the run group directory, summarizing the results of the runs. You can then grade this run using the metadata.json file. For example, to grade the run group <run-group>, you can first use experiments/make_submission.py to generate a submission JSONL file:

python experiments/make_submission.py --metadata runs/<run-group>/metadata.json --output runs/<run-group>/submission.jsonl

You can then use the mlebench grade command to grade this submission:

mlebench grade --submission runs/<run-group>/submission.jsonl --output-dir runs/<run-group>

If you'd like to update the configuration of the container, you can edit the default container config in environment/config/container_configs/default.json, or specify a custom container config JSON file when executing run_agent.py. If you'd like to run the agent with a GPU, you can set "gpus": -1 in the container config JSON file.

dummy agent

We used the dummy agent to verify that our environment was configured correctly. It performs the following checks:

Prints to stdout whether its python script main.py is running with root access
Prints to stdout the Python interpreter the dummy agent is using
Attempts to use the sample submission of the current competition as it's submission. Note, this sometimes fails as sample submissions are sometimes not in the expected format. For example, some competitions compress the sample submission along with the other data into a single file
Checks that it can't read the "private" data; this includes the labels of the test set
Checks it has read/write access to /home, it's working directory

Security

By default we run AIDE, MLAB, and the dummy agent using the sysbox runtime. We recommend users to use this runtime, or other runtimes with sufficient security guarantees, when running models on this benchmark. Since container runtimes like Docker can share host resources with containers, a malicious container could potentially compromise a host; sysbox is designed to mitigate this risk by providing enhanced isolation between containers and the host system

OpenHands uses it's own docker containers during execution, resulting in docker-in-docker (DinD). The sysbox runtime does not yet support GPU-passthrough in DinD (link). We were therefore required to run OpenHands in "privileged" mode, which we deemed acceptable given that our (internal) evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. We wish users to acknowledge this security risk, thus require users to set the I_ACCEPT_RUNNING_PRIVILEGED_CONTAINERS environment variable to True when running OpenHands through run_agent.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Agents

Prerequisites

Running agents

dummy agent

Security

Files

README.md

Latest commit

History

README.md

File metadata and controls

Agents

Prerequisites

Running agents

dummy agent

Security