Code to reproduce the experiments, figures and tables of the paper Training on the Test Task Confounds Evaluation and Emergence.
- The folder experiments/ contains the code to fine-tune models on the datasets of task-relevant data considered, and to evaluate models using the LM Evaluation Harness library.
- The folder notebooks/evaluations contains the model evaluation files.
- The Jupyter notebook notebooks/figures.ipynb reproduces the figures and tables in the paper.
- The fine-tuned models are currently being uploaded here.