v0.5.0
Breaking changes
- The
--run-specs
flag was renamed to--run-entries
(#2404) - The
run_specs*.conf
files were renamed torun_entries*.conf
(#2430) - The
model_metadata
field was removed fromschema*.yaml
files (#2195) - The
helm.proxy.clients
package was moved tohelm.clients
(#2413) - The
helm.proxy.tokenizers
package was moved tohelm.tokenizers
(#2403) - The frontend only supports JSON output produced by
helm-summarize
at version 0.3.0 or newer (#2455) - The
Sequence
class was renamed toGeneratedOutput
(#2551) - The
black
linter was upgraded from 22.10.0 to 24.3.0, which produces different output - runpip install --upgrade black==24.3.0
to upgrade this dependency (#2545) - The
anthropic
dependency was upgraded fromanthropic~=0.2.5
toanthropic~=0.17
- runpip install --upgrade anthropic~=0.17
to upgrade this dependency (#2432) - The
openai
dependency was upgraded fromopenai~=0.27.8
toopenai~=1.0
- runpip install --upgrade openai~=1.0
to upgrade this dependency (#2384)- The SQLite cache is not compatible across this dependency upgrade - if you encounter an
ModuleNotFoundError: No module named 'openai.openai_object'
error after upgradingopenai
, you will have to delete your old OpenAI SQLite cache (e.g. by runningrm prod_env/cache/openai.sqlite
)
- The SQLite cache is not compatible across this dependency upgrade - if you encounter an
Scenarios
- Added DecodingTrust (#1827)
- Added Hateful Memes (#1992)
- Added MMMU (#2259)
- Added Image2Structure (#2267, #2472)
- Added MMU (#2259)
- Added LMEntry (#1694)
- Added Unicorn vision-language scenario (#2456)
- Added Bingo vision-language scenario (#2456)
- Added MultipanelVQA (#2517)
- Added POPE (#2517)
- Added MuliMedQA (#2524)
- Added ThaiExam (#2534)
- Added Seed-Bench and MME (#2559)
- Added Mementos vision-language scenario (#2555)
- Added Unitxt integration (#2442, #2553)
Models
- Added OpenAI gpt-3.5-turbo-1106, gpt-3.5-turbo-0125, gpt-4-vision-preview, gpt-4-0125-preview, and gpt-3.5-turbo-instruct (#2189, #2295, #2376, #2400)
- Added Google Gemini 1.0, Gemini 1.5, and Gemini Vision (#2186, #2189, #2561)
- Improved handling of content blocking in the Vertex AI client (#2546, #2313)
- Added Claude 3 (#2432, #2440, #2536)
- Added Mistral Small, Medium and Large (#2307, #2333, #2399)
- Added Mixtral 8x7b Instruct and 8x22B (#2416, #2562)
- Added Luminous Multimodal (#2189)
- Added Llava and BakLava (#2234)
- Added Phi-2 (#2338)
- Added Qwen1.5 (#2338, #2369)
- Added Qwen VL and VL Chat (#2428)
- Added Amazon Titan (#2165)
- Added Google Gemma (#2397)
- Added OpenFlamingo (#2237)
- Removed logprobs from models hosted on Together (#2325)
- Added support for vLLM (#2402)
- Added DeepSeek LLM 67B Chat (#2563)
- Added Llama 3 (#2579)
- Added DBRX Instruct (#2585)
Framework
- Added support for text-to-image models (#1939)
- Refactored of
Metric
class structure (#2170, #2171, #2218) - Fixed bug in computing general metrics (#2172)
- Added a
--disable-cache
flag to disable caching inhelm-run
(#2143) - Added a
--schema-path
flag to support user-providedschema.yaml
files inhelm-summarize
(#2520)
Frontend
- Switched to the new React frontend for local development by default (#2251)
- Added support for displaying images (#2371)
- Made various improvements to project and version dropdown menus (#2272, #2401, #2458)
- Made row and column headers sticky in leaderboard tables (#2273, #2275)
Evaluation Results
- Lite v1.1.0
- Added results for Phi-2 and Mistral Medium
- Lite v1.2.0
- Added results for Llama 3, Mixtral 8x22B, OLMo, Qwen1.5, and Gemma
- HEIM v1.1.0
- Added results for Adobe GigaGAN and DeepFloyd IF
- Instruct v1.0.0
- Initial release with results for OpenAI GPT-4, OpenAI GPT-3.5 Turbo, Anthropic Claude v1.3, Cohere Command beta
- MMLU v1.0.0
- Initial release with 22 models
- MMLU v1.1.0
- Added results for Llama 3, Mixtral 8x22B, OLMo, and Qwen1.5 (32B)
Contributors
Thank you to the following contributors for your work on this HELM release!
- @acphile
- @akashc1
- @AlphaPav
- @andyzorigin
- @boxin-wbx
- @brianwgoldman
- @chenweixin107
- @danielz02
- @elronbandel
- @farzaank
- @garyxcj
- @ImKeTT
- @JosselinSomervilleRoberts
- @kangmintong
- @michiyasunaga
- @mmonfort
- @mtake
- @percyliang
- @polaris-73
- @pongib
- @ritik99
- @ruixin31
- @sbdzdz
- @shenmishajing
- @teetone
- @tybrs
- @YianZhang
- @yifanmai
- @yoavkatz