Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish4 temp #11

Open
wants to merge 257 commits into
base: polish3
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
257 commits
Select commit Hold shift + click to select a range
1cb35cc
fixes
djstrong Feb 6, 2024
55f274b
fix: change the cbd_mc to be CATEGORIES-based
kacpermilan Feb 6, 2024
35af374
fix: typo in cbd_mc.yaml
kacpermilan Feb 6, 2024
9540f16
fix: typo in cbd_mc.yaml
kacpermilan Feb 6, 2024
d3d7d01
update polish groups
djstrong Feb 27, 2024
c4679ce
fix regex tasks; add benchmark groups
djstrong Mar 1, 2024
7fc327d
fix stderr aggregation
djstrong Mar 5, 2024
e14e593
add perplexity task
djstrong Mar 10, 2024
bc61568
belebele mc
djstrong Mar 10, 2024
85eb77f
Update task_guide.md (#1316)
djstrong Jan 18, 2024
0632a05
Update polemo2_in.yaml (#1318)
lintangsutawika Jan 19, 2024
bb879de
don't pass extra kwargs to mamba any more (#1328)
haileyschoelkopf Jan 22, 2024
6414edd
Fix Issue regarding stderr (#1327)
lintangsutawika Jan 22, 2024
66783f6
Add `local-completions` support using OpenAI interface (#1277)
mgoin Jan 22, 2024
f0ba560
fallback to classname when LM doesnt have config (#1334)
nairbv Jan 22, 2024
9dd448b
fix a trailing whitespace that breaks a lint job (#1335)
nairbv Jan 22, 2024
4f263af
skip "benchmarks" in changed_tasks (#1336)
baberabb Jan 23, 2024
0d8d549
Update migrated HF dataset paths (#1332)
haileyschoelkopf Jan 23, 2024
268d252
Don't use `get_task_dict()` in task registration / initialization (#1…
haileyschoelkopf Jan 23, 2024
82e319d
manage default (greedy) gen_kwargs in vllm (#1341)
baberabb Jan 23, 2024
0938c13
modified default gen_kwargs to work better with CLI; changed prompt_l…
baberabb Jan 24, 2024
97361ed
update links to task_guide.md (#1348)
haileyschoelkopf Jan 24, 2024
ca3a895
`Filter` docs not offset by `doc_id` (#1349)
baberabb Jan 25, 2024
2eeaf15
Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (#1330)
haileyschoelkopf Jan 25, 2024
d467d2f
Refix issue regarding stderr (#1357)
thnkinbtfly Jan 26, 2024
f41ac12
Add causalLM OpenVino models (#1290)
NoushNabi Jan 26, 2024
154f5fa
Apply some best practices and guideline recommendations to code (#1363)
LSinev Jan 28, 2024
b43d9d9
serialize callable functions in config (#1367)
baberabb Jan 29, 2024
2b31cfb
delay filter init; remove `*args` (#1369)
baberabb Jan 30, 2024
cdc41c4
Fix unintuitive `--gen_kwargs` behavior (#1329)
haileyschoelkopf Jan 31, 2024
b39e8da
Publish to pypi (#1194)
anjor Jan 31, 2024
0a39c84
Make dependencies compatible with PyPI (#1378)
haileyschoelkopf Jan 31, 2024
b7513d3
Add support for RWKV models with World tokenizer (#1374)
PicoCreator Jan 31, 2024
7d068d2
add bypass metric (#1156)
baberabb Jan 31, 2024
b284735
Expand docs, update CITATION.bib (#1227)
haileyschoelkopf Feb 1, 2024
80c158c
Hf: minor egde cases (#1380)
baberabb Feb 1, 2024
d55e918
Enable override of printed `n-shot` in table (#1379)
haileyschoelkopf Feb 1, 2024
d6b65f1
Faster Task and Group Loading, Allow Recursive Groups (#1321)
lintangsutawika Feb 1, 2024
5810eac
Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/13…
pminervini Feb 2, 2024
bad70e7
fix on --task list (#1387)
lintangsutawika Feb 2, 2024
09ca8ff
Support for Inf2 optimum class [WIP] (#1364)
michaelfeil Feb 5, 2024
590bcc7
Update README.md (#1398)
mycoalchen Feb 6, 2024
4ed48ca
Fix confusing `write_out.py` instructions in README (#1371)
haileyschoelkopf Feb 6, 2024
77b79a0
Use Pooled rather than Combined Variance for calculating stderr of ta…
haileyschoelkopf Feb 6, 2024
ca8c608
adding hf_transfer (#1400)
michaelfeil Feb 6, 2024
79378a8
`batch_size` with `auto` defaults to 1 if `No executable batch size f…
pminervini Feb 7, 2024
a04bf2b
use reversed task hierarchy for print (#1414)
haileyschoelkopf Feb 9, 2024
f2220c7
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416…
pminervini Feb 10, 2024
5b0db7a
Fix watchdog timeout (#1404)
JeevanBhoot Feb 10, 2024
8d82b49
Evaluate (#1385)
baberabb Feb 11, 2024
80e0a4f
Add multilingual ARC task (#1419)
uanu2002 Feb 11, 2024
5c1b249
Add multilingual TruthfulQA task (#1420)
uanu2002 Feb 11, 2024
66e9620
[m_mmul] added multilingual evaluation from alexandrainst/m_mmlu (#1358)
giux78 Feb 12, 2024
c2c361c
Added seeds to `evaluator.simple_evaluate` signature (#1412)
Am1n3e Feb 12, 2024
af3ca77
Fix: task weighting by subtask size ; update Pooled Stderr formula sl…
haileyschoelkopf Feb 13, 2024
205c870
Refactor utilities into a separate model utils file. (#1429)
baberabb Feb 14, 2024
71bbba4
Update README.md (#1430)
davidbhoffmann Feb 15, 2024
d027702
improve hf_hub activation (#1438)
michaelfeil Feb 18, 2024
8315c1f
Correct typo in task name (#1443)
larekrow Feb 19, 2024
ba89cd6
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zero…
thnkinbtfly Feb 19, 2024
f3e993d
Add a new task HaeRae-Bench (#1445)
h-albert-lee Feb 20, 2024
44254b3
Group reqs by context (#1425)
baberabb Feb 20, 2024
c51d0ce
Add a new task GPQA (the part without CoT) (#1434)
uanu2002 Feb 20, 2024
7dc04ed
Added KMMLU evaluation method and changed ReadMe (#1447)
h-albert-lee Feb 21, 2024
fbd9bf6
Add TemplateLM boilerplate LM class (#1279)
anjor Feb 22, 2024
0d1af67
Log which subtasks were called with which groups (#1456)
haileyschoelkopf Feb 22, 2024
b8bee2c
PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440)
leocnj Feb 22, 2024
cf1577a
feat: Add Weights and Biases support (#1339)
ayulockin Feb 22, 2024
dd5bee9
Fixed generation args issue affection OpenAI completion model (#1458)
Am1n3e Feb 22, 2024
be5a419
update parsing logic of mgsm following gsm8k (#1462)
thnkinbtfly Feb 23, 2024
4024ebb
Adding documentation for Weights and Biases CLI interface (#1466)
veekaybee Feb 23, 2024
8a4827a
Add environment and transformers version logging in results dump (#1464)
LSinev Feb 24, 2024
72d40c9
Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1…
LSinev Feb 26, 2024
053cf56
setting trust_remote_code (#1467)
veekaybee Feb 26, 2024
e112b37
add arabic mmlu (#1402)
khalil-Hennara Feb 26, 2024
420556e
Add Gemma support (Add flag to control BOS token usage) (#1465)
haileyschoelkopf Feb 26, 2024
06a4347
Revert "setting trust_remote_code (#1467)" (#1474)
haileyschoelkopf Feb 26, 2024
af2d9f6
Create a means for caching task registration and request building. Ad…
inf3rnus Feb 26, 2024
9600d59
Cont metrics (#1475)
lintangsutawika Feb 26, 2024
7fe8dcb
Refactor `evaluater.evaluate` (#1441)
baberabb Feb 27, 2024
77ffeef
add multilingual mmlu eval (#1484)
jordane95 Feb 27, 2024
6093c0c
update name of val split in truthfulqa multilingual (#1488)
haileyschoelkopf Feb 27, 2024
814f36e
Fix AttributeError in huggingface.py When 'model_type' is Missing (#1…
richwardle Feb 27, 2024
c463825
fix duplicated kwargs in some model init (#1495)
lchu-ibm Feb 28, 2024
47d0899
Add multilingual truthfulqa targets (#1499)
jordane95 Mar 1, 2024
0413dee
always include EOS token in stopsequences if possible (#1480)
haileyschoelkopf Mar 1, 2024
d579c8b
Improve data-parallel request partitioning for VLLM (#1477)
haileyschoelkopf Mar 1, 2024
8146103
modify `WandbLogger` to accept arbitrary kwargs (#1491)
baberabb Mar 1, 2024
30141ce
Vllm update DP+TP (#1508)
baberabb Mar 3, 2024
706e10b
Setting trust_remote_code to True for HuggingFace datasets compatibil…
veekaybee Mar 3, 2024
40b0917
Cleaning up unused unit tests (#1516)
veekaybee Mar 4, 2024
4f19431
French Bench (#1500)
ManuelFay Mar 4, 2024
512de72
Hotfix: fix TypeError in `--trust_remote_code` (#1517)
haileyschoelkopf Mar 4, 2024
b915040
Fix minor edge cases (#951 #1503) (#1520)
haileyschoelkopf Mar 4, 2024
2c652b5
Openllm benchmark (#1526)
baberabb Mar 5, 2024
175bc29
Add a new task GPQA (the part CoT and generative) (#1482)
uanu2002 Mar 5, 2024
5c8105c
Add EQ-Bench as per #1459 (#1511)
pbevan1 Mar 6, 2024
44f9421
Add WMDP Multiple-choice (#1534)
justinphan3110 Mar 6, 2024
c9f39fa
Adding new task : KorMedMCQA (#1530)
sean0042 Mar 6, 2024
7aedaf9
Update docs on LM.loglikelihood_rolling abstract method (#1532)
haileyschoelkopf Mar 6, 2024
8c1c093
update printed num-fewshot ; prevent fewshots from erroneously being …
haileyschoelkopf Mar 6, 2024
f238713
Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1…
LSinev Mar 6, 2024
3b419af
Update installation commands in openai_completions.py and contributin…
naem1023 Mar 6, 2024
6997af7
Add compatibility for vLLM's new Logprob object (#1549)
Yard1 Mar 9, 2024
74d9a95
Fix incorrect `max_gen_toks` generation kwarg default in code2_text. …
cosmo3769 Mar 9, 2024
8d5e277
Support jinja templating for task descriptions (#1553)
HishamAlyahya Mar 10, 2024
7ffd0d1
Update generate_until_template_yaml (#1546)
haileyschoelkopf Mar 11, 2024
58cda52
Update ifeval.yaml (#1506)
haileyschoelkopf Mar 11, 2024
1858b54
add Arabic EXAMS benchmark (#1498)
khalil-Hennara Mar 11, 2024
5298fc0
AGIEval (#1359)
haileyschoelkopf Mar 11, 2024
94f7159
cli_evaluate calls simple_evaluate with the same verbosity. (#1563)
Wongboo Mar 12, 2024
ee0e166
add manual tqdm disabling management (#1569)
artemorloff Mar 13, 2024
28e568d
Fix README section on vllm integration (#1579)
eitanturok Mar 15, 2024
df6ee7a
Fix Jinja template for Advanced AI Risk (#1587)
RylanSchaeffer Mar 15, 2024
c6edcdb
Proposed approach for testing CLI arg parsing (#1566)
veekaybee Mar 17, 2024
0dc609d
Patch for Seq2Seq Model predictions (#1584)
lintangsutawika Mar 17, 2024
baa917f
Add start date in results.json (#1592)
djstrong Mar 17, 2024
53c11f7
Cleanup for v0.4.2 release (#1573)
haileyschoelkopf Mar 18, 2024
6e52d16
Fix eval_logger import for mmlu/_generate_configs.py (#1593)
noufmitla Mar 18, 2024
8cd155f
use BOS token in loglikelihood (#1588)
djstrong Mar 18, 2024
1ea55eb
Revert "Patch for Seq2Seq Model predictions (#1584)" (#1601)
haileyschoelkopf Mar 19, 2024
5a304c9
fix gen_kwargs arg reading (#1607)
artemorloff Mar 19, 2024
39a0b3a
fix until arg processing (#1608)
artemorloff Mar 19, 2024
a513931
Fixes to Loglikelihood prefix token / VLLM (#1611)
haileyschoelkopf Mar 20, 2024
7d8eeba
Add ACLUE task (#1614)
haonan-li Mar 21, 2024
45ed815
OpenAI Completions -- fix passing of unexpected 'until' arg (#1612)
haileyschoelkopf Mar 21, 2024
9064d35
add logging of model args (#1619)
baberabb Mar 22, 2024
7c7e4fd
Add vLLM FAQs to README (#1625) (#1633)
haileyschoelkopf Mar 25, 2024
f970123
peft Version Assertion (#1635)
LameloBally Mar 25, 2024
048c0d3
Seq2seq fix (#1604)
lintangsutawika Mar 25, 2024
9f50796
Integration of NeMo models into LM Evaluation Harness library (#1598)
sergiopperez Mar 26, 2024
f0b04a0
Fix conditional import for Nemo LM class (#1641)
haileyschoelkopf Mar 27, 2024
fa2acde
Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring …
orsharir Mar 28, 2024
b948d14
Add Latxa paper evaluation tasks for Basque (#1654)
juletx Apr 1, 2024
da93b8a
Fix CLI --batch_size arg for openai-completions/local-completions (#1…
mgoin Apr 1, 2024
cf10ee7
Patch QQP prompt (#1661)
haileyschoelkopf Apr 4, 2024
76a7c23
TMMLU+ implementation (#1394)
ZoneTwelve Apr 5, 2024
6786e82
Anthropic Chat API (#1594)
tryumanshow Apr 5, 2024
98693bf
correction bug EleutherAI#1664 (#1670)
nicho2 Apr 7, 2024
c374e6f
Update README.md (#1680)
haileyschoelkopf Apr 8, 2024
8518800
Add delta weights model loading (#1712)
KonradSzafer Apr 16, 2024
8103925
Add `neuralmagic` models for `sparseml` and `deepsparse` (#1674)
mgoin Apr 16, 2024
a56bf85
fix error when appending eot_token_id for generate_until tasks (#1699)
sergiopperez Apr 18, 2024
a09b018
Adding retries and rate limit to toxicity tasks (#1620)
sator-labs Apr 18, 2024
6687de7
reference `--tasks list` in README (#1726)
nairbv Apr 25, 2024
fe92e5a
Add XNLIeu: a dataset for cross-lingual NLI in Basque (#1694)
juletx Apr 25, 2024
d69d54d
Fix Parameter Propagation for Tasks that have `include` (#1749)
lintangsutawika Apr 25, 2024
f38e8a1
Support individual scrolls datasets (#1740)
giorgossideris Apr 26, 2024
7cd59dd
Add filter registry decorator (#1750)
lozhn Apr 26, 2024
dabce43
remove duplicated `num_fewshot: 0` (#1769)
chujiezheng May 1, 2024
f4281a4
Pile 10k new task (#1758)
mukobi May 1, 2024
c51925d
Fix m_arc choices (#1760)
jordane95 May 1, 2024
e2bc623
upload new tasks (#1728)
simran-arora May 1, 2024
df05e78
vllm lora support (#1756)
bcicc May 2, 2024
af14500
Add option to set OpenVINO config (#1730)
helena-intel May 2, 2024
ba53c71
evaluation tracker implementation (#1766)
KonradSzafer May 3, 2024
da3067f
eval tracker args fix (#1777)
KonradSzafer May 3, 2024
ffc6594
limit fix (#1785)
KonradSzafer May 5, 2024
d261c2f
remove echo parameter in OpenAI completions API (#1779)
djstrong May 5, 2024
29812e7
Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (#1776)
MuhammadBinUsman03 May 5, 2024
45c5f41
Fix bug in setting until kwarg in openai completions (#1784)
ciaranby May 5, 2024
615b2dd
Provide ability for custom sampler for ConfigurableTask (#1616)
LSinev May 6, 2024
59c553a
Update `--tasks list` option in interface documentation (#1792)
sepiatone May 6, 2024
4e63a32
Fix Caching Tests ; Remove `pretrained=gpt2` default (#1775)
haileyschoelkopf May 7, 2024
ea773e4
link to the example output on the hub (#1798)
KonradSzafer May 7, 2024
aa4e118
Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt…
haileyschoelkopf May 7, 2024
b3e8661
Logging Updates (Alphabetize table printouts, fix eval tracker bug) (…
haileyschoelkopf May 7, 2024
c864ea2
Initial integration of the Unitxt to LM eval harness (#1615)
yoavkatz May 7, 2024
bba2bf6
add task for mmlu evaluation in arc multiple choice format (#1745)
jonabur May 8, 2024
a137c3e
Update flag `--hf_hub_log_args` in interface documentation (#1806)
sepiatone May 8, 2024
cd0b2ba
Copal task (#1803)
Erland366 May 9, 2024
6bcb05e
Adding tinyBenchmarks datasets (#1545)
LucWeber May 13, 2024
e888fb6
interface doc update (#1807)
KonradSzafer May 13, 2024
ab46906
Fix links in README guiding to another branch (#1838)
LSinev May 14, 2024
5759d86
Fix: support PEFT/LoRA with added tokens (#1828)
mapmeld May 19, 2024
b542fd9
fixed incorrect check for task type (replace `~` with `not`) (#1865)
zafstojano May 21, 2024
d02eb34
fixed docs typos (#1863)
zafstojano May 21, 2024
21f36dd
Unpin vllm in dependencies (#1874)
edgan8 May 23, 2024
5ca629a
Fix outdated links to the latest links in `docs` (#1876)
oneonlee May 24, 2024
2b93289
[HFLM]Use Accelerate's API to reduce hard-coded CUDA code (#1880)
statelesshz May 24, 2024
e6223c0
Fix `batch_size=auto` for HF Seq2Seq models (#1765) (#1790)
haileyschoelkopf May 24, 2024
8329adb
Fix Brier Score (#1847)
lintangsutawika May 24, 2024
e3ec75f
Fix for bootstrap_iters = 0 case (#1715) (#1789)
haileyschoelkopf May 24, 2024
ee44bf2
add mmlu tasks from pile-t5 (#1710)
lintangsutawika May 24, 2024
83f9d66
Bigbench fix (#1686)
lintangsutawika May 24, 2024
fe6fb1a
Rename `lm_eval.logging -> lm_eval.loggers` (#1858)
haileyschoelkopf May 26, 2024
b69aecc
Updated vllm imports in vllm_causallms.py (#1890)
mgoin May 28, 2024
d177975
[HFLM]Add support for Ascend NPU (#1886)
statelesshz May 30, 2024
bbc1216
`higher_is_better` tickers in output table (#1893)
zafstojano May 30, 2024
ebc3807
Add dataset card when pushing to HF hub (#1898)
KonradSzafer May 31, 2024
105b516
Making hardcoded few shots compatible with the chat template mechanis…
clefourrier May 31, 2024
acc4029
Try to make existing tests run little bit faster (#1905)
LSinev May 31, 2024
e53f271
Fix fewshot seed only set when overriding num_fewshot (#1914)
LSinev Jun 3, 2024
85550b3
Complete task list from pr 1727 (#1901)
anthony-dipofi Jun 3, 2024
0f995d9
Add chat template (#1873)
KonradSzafer Jun 3, 2024
aceb0ce
Multiple Choice Questions and Large Languages Models: A Case Study wi…
maximegmd Jun 5, 2024
55c36de
Modify pre-commit hook to check merge conflicts accidentally committe…
LSinev Jun 5, 2024
2d1ffb9
[add] fld logical formula task (#1931)
MorishT Jun 6, 2024
c63d56a
Add new Lambada translations (#1897)
zafstojano Jun 6, 2024
17fcd25
Implement NoticIA (#1912)
ikergarcia1996 Jun 6, 2024
58264ac
Add The Arabic version of the PICA benchmark (#1917)
khalil-Hennara Jun 7, 2024
66e2c9d
Update siqa.yaml (#1909)
haileyschoelkopf Jun 7, 2024
1865671
Update basque-glue (#1913)
zhabuye Jun 7, 2024
eaf6696
Test output table layout consistency (#1916)
zafstojano Jun 7, 2024
a0c1aeb
polqa
djstrong Mar 22, 2024
4320c18
update polish benchmarks
chrisociepa Jan 17, 2024
ff41506
update polish benchmarks
djstrong Mar 25, 2024
15950dd
Add task definitions: 8tags, dyk, ppc, psc, belebele PL (regex), pole…
chrisociepa Jan 17, 2024
a107ca9
task definitions fixes
djstrong Jan 18, 2024
6b8e7b3
Polish benchmark
djstrong Jan 18, 2024
8568c6e
fix regex tasks; add benchmark groups
djstrong Jan 22, 2024
ca605fd
feat: add polish CBD and KLEJ NER benchmarks
kacpermilan Feb 5, 2024
18e618e
fix regex tasks; add benchmark groups
djstrong Mar 1, 2024
76a4f36
update polish benchmarks
chrisociepa Jan 17, 2024
8f0d25c
update polish benchmarks
djstrong Mar 25, 2024
4972634
feat: add the PoQuAD dataset
kacpermilan May 12, 2024
6c4b0a1
fix: tune the open-book prompt
kacpermilan May 13, 2024
d880314
fix psc regex
djstrong Jun 2, 2024
f876552
fix poquad
djstrong Jun 2, 2024
02dd644
polish eq-bench
djstrong Jun 2, 2024
637afd1
polish eq-bench
djstrong Jun 2, 2024
53039c2
polish eq-bench
djstrong Jun 2, 2024
6d5e657
polish eq-bench
djstrong Jun 2, 2024
38e954a
polish eq-bench
djstrong Jun 2, 2024
88e0034
polish eq-bench
djstrong Jun 2, 2024
458cdc7
polish eq-bench
djstrong Jun 2, 2024
fea7b68
polish eq-bench
djstrong Jun 2, 2024
2693979
polish eq-bench
djstrong Jun 2, 2024
776a4d3
polish eq-bench
djstrong Jun 2, 2024
53f1dfc
polish eq-bench
djstrong Jun 2, 2024
87b2160
polish eq-bench
djstrong Jun 2, 2024
8998552
polish eq-bench
djstrong Jun 2, 2024
b6c4ac3
polish eq-bench
djstrong Jun 2, 2024
f9ca054
polish eq-bench
djstrong Jun 2, 2024
29959a3
polish eq-bench
djstrong Jun 2, 2024
aaaac9d
polish eq-bench
djstrong Jun 2, 2024
db195a2
polish eq-bench
djstrong Jun 3, 2024
d96cd84
polish eq-bench
djstrong Jun 3, 2024
2df424f
polish eq-bench
djstrong Jun 3, 2024
badffa9
polish eq-bench
djstrong Jun 4, 2024
c3a1bec
polish eq-bench
djstrong Jun 8, 2024
1ca2260
fgd
djstrong Jun 8, 2024
ee8f8be
fgd
djstrong Jun 8, 2024
5d61f54
generate until <|im_end|>
djstrong Jun 13, 2024
10a79a2
powuad; pes; hash fix
djstrong Aug 2, 2024
8819b64
fix multiple choice openai
djstrong Aug 2, 2024
45f6010
fix multiple choice openai
djstrong Aug 2, 2024
8974fc2
fix multiple choice openai
djstrong Aug 2, 2024
0bea423
fix belebele
djstrong Aug 13, 2024
21d0ea9
polish pes split
djstrong Aug 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
78 changes: 78 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
name: Publish Python distribution to PyPI

on:
push:
tags:
- '*'

jobs:
build:
name: Build distribution
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"

- name: Install pypa/build
run: >-
python3 -m
pip install
build
--user
- name: Build a binary wheel and a source tarball
run: python3 -m build
- name: Store the distribution packages
uses: actions/upload-artifact@v3
with:
name: python-package-distributions
path: dist/

publish-to-pypi:
name: >-
Publish Python distribution to PyPI
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
needs:
- build
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/lm_eval
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

publish-to-testpypi:
name: Publish Python distribution to TestPyPI
needs:
- build
runs-on: ubuntu-latest

environment:
name: testpypi
url: https://test.pypi.org/p/lm_eval

permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
2 changes: 1 addition & 1 deletion .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e '.[dev,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu
pip install -e '.[dev,anthropic,sentencepiece,optimum,deepsparse,sparseml]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,8 @@ temp
# IPython
profile_default/
ipython_config.py
# don't track (the default location of) the cached requests
lm_eval/caching/.cache
# don't track files created by wandb
wandb
examples/wandb
7 changes: 4 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@
exclude: ^tests/testdata/
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.1.0
rev: v4.5.0
hooks:
- id: check-added-large-files
- id: check-ast
- id: check-byte-order-marker
- id: check-case-conflict
- id: check-json
- id: check-merge-conflict
args: [--assume-in-merge]
- id: check-symlinks
- id: check-yaml
args: ["--unsafe"]
Expand All @@ -29,7 +30,7 @@ repos:
args: [--fix=lf]
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.1.8
rev: v0.2.2
hooks:
# Run the linter.
- id: ruff
Expand All @@ -38,7 +39,7 @@ repos:
# Run the formatter.
- id: ruff-format
- repo: https://github.com/codespell-project/codespell
rev: v2.1.0
rev: v2.2.6
hooks:
- id: codespell
exclude: >
Expand Down
203 changes: 175 additions & 28 deletions README.md

Large diffs are not rendered by default.

81 changes: 81 additions & 0 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Contributing to LM Evaluation Harness

Welcome and thank you for your interest in the LM Evaluation Harness! We welcome contributions and feedback and appreciate your time spent with our library, and hope you find it useful!

We intend LM Evaluation Harness to be a broadly useful and

## Important Resources

There are several places information about LM Evaluation Harness is located:

- Our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)
- We occasionally use [GitHub Milestones](https://github.com/EleutherAI/lm-evaluation-harness/milestones) to track progress toward specific near-term version releases.
- We maintain a [Project Board](https://github.com/orgs/EleutherAI/projects/25) for tracking current work items and PRs, and for future roadmap items or feature requests.
- Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](discord.gg/eleutherai).

## Code Style

LM Evaluation Harness uses [ruff](https://github.com/astral-sh/ruff) for linting via [pre-commit](https://pre-commit.com/).

You can install linters and dev tools via

```pip install lm_eval[dev]``` or ```pip install -e ".[dev]"```

Then, run

```pre-commit install```

in order to ensure linters and other checks will be run upon committing.

## Testing

We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:

```
python -m pytest --ignore=tests/tests_master --ignore=tests/extra
```

## Contributor License Agreement

We ask that new contributors agree to a Contributor License Agreement affirming that EleutherAI has the rights to use your contribution to our library.
First-time pull requests will have a reply added by @CLAassistant containing instructions for how to confirm this, and we require it before merging your PR.


## Contribution Best Practices

We recommend a few best practices to make your contributions or reported errors easier to assist with.

**For Pull Requests:**
- PRs should be titled descriptively, and be opened with a brief description of the scope and intent of the new contribution.
- New features should have appropriate documentation added alongside them.
- Aim for code maintainability, and minimize code copying.
- If opening a task, try to share test results on the task using a publicly-available model, and if any public results are available on the task, compare to them.

**For Feature Requests:**
- Provide a short paragraph's worth of description. What is the feature you are requesting? What is its motivation, and an example use case of it? How does this differ from what is currently supported?

**For Bug Reports**:
- Provide a short description of the bug.
- Provide a *reproducible example*--what is the command you run with our library that results in this error? Have you tried any other steps to resolve it?
- Provide a *full error traceback* of the error that occurs, if applicable. A one-line error message or small screenshot snippet is unhelpful without the surrounding context.
- Note what version of the codebase you are using, and any specifics of your environment and setup that may be relevant.

**For Requesting New Tasks**:
- Provide a 1-2 sentence description of what the task is and what it evaluates.
- Provide a link to the paper introducing the task.
- Provide a link to where the dataset can be found.
- Provide a link to a paper containing results on an open-source model on the task, for use in comparisons and implementation validation.
- If applicable, link to any codebase that has implemented the task (especially the original publication's codebase, if existent).

## How Can I Get Involved?

To quickly get started, we maintain a list of good first issues, which can be found [on our project board](https://github.com/orgs/EleutherAI/projects/25/views/8) or by [filtering GH Issues](https://github.com/EleutherAI/lm-evaluation-harness/issues?q=is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22). These are typically smaller code changes or self-contained features which can be added without extensive familiarity with library internals, and we recommend new contributors consider taking a stab at one of these first if they are feeling uncertain where to begin.

There are a number of distinct ways to contribute to LM Evaluation Harness, and all are extremely helpful! A sampling of ways to contribute include:
- **Implementing and verifying new evaluation tasks**: Is there a task you'd like to see LM Evaluation Harness support? Consider opening an issue requesting it, or helping add it! Verifying and cross-checking task implementations with their original versions is also a very valuable form of assistance in ensuring standardized evaluation.
- **Improving documentation** - Improvements to the documentation, or noting pain points / gaps in documentation, are helpful in order for us to improve the user experience of the library and clarity + coverage of documentation.
- **Testing and devops** - We are very grateful for any assistance in adding tests for the library that can be run for new PRs, and other devops workflows.
- **Adding new modeling / inference library integrations** - We hope to support a broad range of commonly-used inference libraries popular among the community, and welcome PRs for new integrations, so long as they are documented properly and maintainable.
- **Proposing or Contributing New Features** - We want LM Evaluation Harness to support a broad range of evaluation usecases. If you have a feature that is not currently supported but desired, feel free to open an issue describing the feature and, if applicable, how you intend to implement it. We would be happy to give feedback on the cleanest way to implement new functionalities and are happy to coordinate with interested contributors via GH discussions or via discord.

We hope that this has been helpful, and appreciate your interest in contributing! Further questions can be directed to [our Discord](discord.gg/eleutherai).
8 changes: 4 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!

## Table of Contents

* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
7 changes: 2 additions & 5 deletions docs/decontamination.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,14 @@

## Usage

Simply add a "--decontamination_ngrams_path" when running \__main\__.py. The provided directory should contain
The provided directory should contain
the ngram files and info.json produced in "Pile Ngram Generation" further down.

```bash
python -m lm_eval \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
--tasks sciq
```

## Background
Expand Down Expand Up @@ -70,5 +69,3 @@ python -m scripts/clean_training_data/compress_and_package \
-output path/to/final/directory \
-procs 8
```

Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
Loading