Releases: open-compass/opencompass
0.3.6
The OpenCompass team is thrilled to announce the release of OpenCompus v0.3.6!
🌟 Highlights
✨ This release brings several updates and new features that enhance the functionality and performance of OpenCompass. Notable additions include support for long context evaluations, the introduction of the BABILong dataset, and enhancements to the MuSR dataset. We have also welcomed new contributors to our community, which we are excited to introduce.
🚀 New Features
🔥 Added long context evaluation for base models, expanding the scope of model assessments.
🔥 Introduced the BABILong dataset, enriching the resources available for research and development.
🔥 Added MUSR dataset evaluation, which evaluates language models on multistep soft reasoning tasks.
📖 Documentation
📚 Updated documentation to reflect the latest changes and features, ensuring that users can easily integrate these updates into their workflows.
🐛 Bug Fixes
🛠 Fixed issues with first_option_postprocess
to improve reliability.
🛠 Addressed bugs in the PR testing process to ensure smoother contributions from the community.
⚙ Enhancements and Refactors
🔧 Implemented auto-download for FollowBench, streamlining the setup process for new users.
🔧 Refined the CI/CD pipeline, including daily tests and baseline updates, to maintain high standards of quality and performance.
🎉 Welcome New Contributors
👏 We are delighted to welcome three new contributors who have made valuable contributions to this release:
- @MCplayerFromPRC for pushing InternTrain evaluation differences.
- @DespairL for adding single LoRA adapter support for vLLM inference.
- @abrohamLee for contributing MuSR Dataset Evaluation.
We hope you enjoy this new release and find it useful for your projects. Your feedback is always welcome and helps us improve OpenCompass continuously. Thank you for being part of our community! 🌟
Full Changelog: 0.3.5...0.3.6
0.3.5
The OpenCompass team is thrilled to announce the release of OpenCompress v0.3.5!
🌟 Highlights
- 🚀 Introduction of two new datasets: CMO&AIME, expanding our evaluation capabilities.
- 📖 Several updates to our documentation, ensuring clearer guidance for all users.
- ⚙ Several enhancements and refactoring efforts to make our codebase more robust and maintainable.
🚀 New Features
- 🆕 Added support for the CMO&AIME datasets, broadening the scope of models we can evaluate. (#1610)
- 🆕 Introduced the
CompassArenaSubjectiveBench
, a new benchmark for subjective evaluations. (#1645) - 🆕 Added configurations for the lmdeploy DeepSeek model, enhancing compatibility with cutting-edge technologies. (#1656)
📖 Documentation
- 📚 Updated the documentation to reflect the latest changes and improvements, making it easier than ever to navigate and understand. (#1655)
🐛 Bug Fixes
- 🔧 Fixed issues with the
ruler_16k_gen
component, ensuring more accurate and reliable results. (#1643) - 🔧 Resolved an error in the
get_loglikelihood
function when using lmdeploy as the accelerator. (#1659) - 🔧 Addressed problems with automatic downloads for certain datasets, streamlining the user experience. (#1652)
⚙ Enhancements and Refactors
- 💪 Enhanced the summarizer configurations for models, improving the efficiency and effectiveness of summarization tasks. (#1600)
- 💪 Added new model configurations, keeping up with the latest advancements in machine learning. (#1653)
- 💪 Updated the WildBench maximum sequence length, allowing for better handling of longer input sequences. (#1648)
- 💪 Updated the Needlebench OSS path, ensuring smoother data access and processing. (#1651)
- 💪 Improved the
mmmlu_lite
dataloader, optimizing data loading processes. (#1658)
🎉 Welcome New Contributors
- 👏 A warm welcome to @jnanliu, who has made their first contribution by adding the CMO&AIME datasets! (#1610)
For a complete overview of all changes, please refer to the full changelog: 0.3.4...0.3.5
0.3.4
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.4!
🎉 OpenCompass v0.3.4 brings major enhancements including new benchmarks, improved documentation, and numerous bug fixes.
🌈 Notable features include support for new datasets and the integration of lmdeploy pipeline API.
🔧 Support for New Datasets:
- Addition of GaoKaoMath Dataset for Evaluation.
- Support for MMMLU & MMMLU-lite Benchmark.
- Integration of Judgerbench and reorganization of subeval.
- Support for LiveCodeBench.
📝 Output Format Enhancements:
- Support for printing and saving results as markdown format tables.
🔧 Pipeline and Integration Improvements:
- Integration of lmdeploy pipeline API.
- Update of TurboMindModel through integration of lmdeploy pipeline API.
- Removal of prefix bos_token from messages when using lmdeploy as the accelerator.
🛠️ Miscellaneous Enhancements:
- Updates to the common summarizer regex extraction.
- Internal humaneval postprocess addition and updates.
📖 Documentation Updates
🐛 Bug Fixes
🎉 Welcome New Contributors
👋 New Contributors Joined the Team:
@BobTsang1995 - Contributed support for MMMLU & MMMLU-lite Benchmark.
@noemotiovon - Provided NPU support fixes.
@changlan - Fixed RULER datasets.
@BIGWangYuDong - Added support for printing and saving results as markdown format tables.
Thank you to all contributors who have made this release possible. For a complete list of changes, please see the full changelog linked below.
Full Changelog: 0.3.3...0.3.4
0.3.3
🌟 OpenCompass v0.3.3 Release Log
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.3!
🚀 New Features
- 🔧 Added support for the SciCode summarizer configuration.
- 🛠 Introduced support for internal Followbench.
- 🔧 Updated models and configurations for MathBench & WikiBench under FullBench.
- 🛠 Enhanced support for OpenAI O1 models and Qwen2.5 Instruct.
- 🔧 Included a postprocess function for custom models.
- 🛠 Added InternTrain feature for broader model training scenarios.
📖 Documentation
- 📚 Updated the README with the latest information on how to use OpenCompass effectively.
🐛 Bug Fixes
- 🔧 Fixed issues with the link-check workflow and wildbench.
- 🛠 Resolved errors in partitioning and corrected typos throughout the codebase.
- 🔧 Addressed compatibility issues with lmdeploy interface type changes.
- 🛠 Fixed the followbench dataset configuration and token settings.
⚙ Enhancements and Refactors
- 🛠 Enhanced support for verbose output in OpenAI API interactions.
- 🔧 Updated maximum output length configurations for multiple models.
- 🛠 Improved handling of the "begin section" in meta_template for better parsing.
- 🔧 Added a common summarizer for qabench and expanded test coverage for various models.
🎉 Welcome New Contributors
👋 We'd like to extend a warm welcome to our new contributors who have made their first contributions to OpenCompass:
- @x54-729 introduced InternTrain.
- @chuanyangjin helped correct typos.
- @cuauty added support for reasoning from BaiLing LLM.
Thank you to all our contributors for making this release possible!
Full Changelog: 0.3.2.post1...0.3.3
0.3.2.post1
What's Changed
- [Fix]Init import fix by @MaiziXiao in #1500
- [Bump] Bump version to 0.3.2.post1 by @MaiziXiao in #1502
Full Changelog: 0.3.2...0.3.2.post1
0.3.2
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.2!
🚀 New Features
- 🛠 Added
extra_body
support for OpenAISDK and introduced proxy URL support when connecting to OpenAI's API. - 🗂 Included auto-download functionality for Mmlu-pro, Needlebench, Longbench and other datasets.
- 🤝 Integrated support for the Rendu API.
- 🧪 Added a model postprocess function.
📖 Documentation
- 📜 Updated the README file for better clarity and guidance.
🐛 Bug Fixes
- 🛠 Fixed CLI evaluation for multiple models.
- 🛠 Updated requirements to resolve dependency issues.
- 🛠 Corrected configurations for the Llama model series.
- 🛠 Addressed bad cases and added environment information to improve testing.
⚙ Enhancements and Refactors
- 🛠 Made OPENAI_API_BASE compatible with OpenAI's default environment settings.
- 🛠 Optimized SciCode for improved performance.
- 🛠 Added an
api_key
attribute to TurboMindAPIModel. - 🛠 Implemented fixes and improvements to the CI test environment, including baselines for vllm.
🎉 Welcome New Contributors
- 👋 @cpa2001 contributed with the addition of icl_sliding_k_retriever.py and updates to __init__.py.
- 👋 @gyin94 made the OPENAI_API_BASE compatible with OpenAI's default environment.
- 👋 @chengyingshe added an attribute
api_key
into TurboMindAPIModel. - 👋 @yanzeyu supported the integration of Rendu API.
Full Changelog: 0.3.1...0.3.2
OpenCompass v0.3.1
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.1!
🌟 Highlights
- 🚀 Support pip installation, update Readme and evaluation demo
- 🐛 Fixed various dataset loading issues.
- ⚙️ Enhanced auto-download features for datasets.
🚀 New Features
- 🆕 Introduced support for Ruler datasets.
- 🆕 Enhanced model compatibility.
- 🆕 Improved dataset handling, support auto-download for various datasets
📖 Documentation
- 📚 Updated README to reflect the latest changes.
- 📚 Improved documentation for dataset loading procedures.
🐛 Bug Fixes
- 🐞 Resolved modelscope dataset load issues.
- 🐞 Corrected evaluation scores for the Lawbench dataset.
- 🐞 Fixed dataset bugs for CommonsenseQA and Longbench.
⚙ Enhancements and Refactors
- 🔧 Retained first and last halves of prompts to avoid max_seq_len issues.
- 🔧 Updated Compassbench to v1.3.
- 🔧 Switched to Python runner for single GPU operations.
🎉 Welcome New Contributors
- 🙌 @Yunnglin for fixing modelscope dataset load problem.
- 🙌 @changyeyu for addressing max_seq_len issues with prompt handling.
- 🙌 @seetimee for updates to openai_api.py.
- 🙌 @HariSeldon0 for adding the scicode dataset.
What's Changed
- [Fix] Fix modelscope dataset load problem by @Yunnglin in #1406
- [Fix] the issue where scores are negative in the Lawbench dataset evaluation(#1402) by @yaoyingyy in #1403
- [Doc] Update README by @tonysy in #1404
- Retain first and last halves of prompts to avoid max_seq_len issues by @changyeyu in #1373
- [UPDATE] Compassbench v1.3 by @MaiziXiao in #1396
- [Fix] longbench dataset load fix by @MaiziXiao in #1422
- [Fix] Sub summarizer order fix by @bittersweet1999 in #1426
- [Update] Support auto-download of FOFO/MT-Bench-101 by @tonysy in #1423
- [Bug] Commonsenseqa dataset fix by @MaiziXiao in #1425
- [Feature] Add abbr for rolebench dataset by @xu-song in #1431
- [Feature] Add Ruler datasets by @MaiziXiao in #1310
- [Fix] Fix openai api tiktoken bug for api server by @liushz in #1433
- Update openai_api.py by @seetimee in #1438
- [Feature] Add model support for 'huggingface_above_v4_33' when using '-a' by @liushz in #1430
- Add scicode by @HariSeldon0 in #1417
- [Doc] Update Readme by @MaiziXiao in #1439
- [Fix] Update option postprocess & mathbench language summarizer by @liushz in #1413
- [ci] add commond testcase into daily testcase by @zhulinJulia24 in #1447
- [Feature] Switch to python runner for single GPU by @xu-song in #1308
- [Fix] Update SciCode and Gemma model by @tonysy in #1449
- [Bump] Bump version to 0.3.1 by @tonysy in #1450
Full Changelog: 0.3.0...0.3.1
Thank you for your continued support and contributions to OpenCompass!
OpenCompass v0.3.0
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.0! This release brings a variety of new features, enhancements, and bug fixes to improve your experience.
🌟 Highlights
- Support for OpenAI ChatCompletion
- Updated Model Support List
- Support Dataset Automatic Download
- Support
pip install opencompass
🚀 New Features
- Support for CompassBench Checklist Evaluation
- PR #1339 by @bittersweet1999
- Adding support for Doubao API
- PR #1218 by @LeavittLang
- Support for ModelScope Datasets
- PR #1289 by @wangxingjun778
📖 Documentation
🐛 Bug Fixes
- Fix Typing and Typo
- Fix Lint Issues
- PR #1334 by @DseidLi
- Fix Summary Error in subjective.py
⚙ Enhancements and Refactors
- Upgrade Default Math
pred_postprocessor
- Fix Path and Folder Updates
- Update Get Data Path for LCBench and HumanEval
🔗 Full Change Logs
- [Fix] Change abbr for arenahard dataset by @bittersweet1999 in #1302
- [Fix] Force register by @Leymore in #1311
- [Fix] add bc for alignbench summarizer by @bittersweet1999 in #1306
- [Fix] update Faq by @bittersweet1999 in #1313
- [Fix] Fix rouge evaluator of rolebench_zh by @xu-song in #1322
- [Doc] Update NeedleBench Docs by @DseidLi in #1330
- [Fix] Fix typing and typo by @xu-song in #1331
- [Fix] Fix lint by @DseidLi in #1334
- [Feature] support compassbench Checklist evaluation by @bittersweet1999 in #1339
- Add compassbench wiki&math part by @liushz in #1342
- Compassbench v1_3 subjective evaluation by @MaiziXiao in #1341
- [Fix] Update path and folder by @tonysy in #1344
- Upgrade default math
pred_postprocessor
by @xu-song in #1340 - commit inference ppl datasets by @Quehry in #1315
- CompassBench subjective summarizer added by @MaiziXiao in #1349
- Fix MathBench Generation Config by @liushz in #1351
- [Update] Update model support list by @bittersweet1999 in #1353
- [Update] update Subeval demo config by @bittersweet1999 in #1358
- [Fix] Fix the summary error in subjective.py by @WenjinW in #1363
- [Fix] Support HF models deployed with an OpenAI-compatible API. by @heya5 in #1352
- update docs by @Leymore in #1318
- [Feature] Make NeedleBench available on HF by @DseidLi in #1364
- 【bug fix】: Remove extra ampersands. by @baymax591 in #1365
- [Fix] minor update wildbench by @kleinzcy in #1335
- Adding support for Doubao API by @LeavittLang in #1218
- [Fix] origin_prompt should be None in llm-compression task by @mqy004 in #1225
- Calm dataset by @pengbo807 in #1287
- Add
en
andzh
groups to longbench summarizer; Fix longbench overall score by @xu-song in #1216 - [Revert] "Calm dataset (#1287)" by @bittersweet1999 in #1366
- Charm by @jxd0712 in #1230
- Support ModelScope datasets by @wangxingjun778 in #1289
- [Feature] Update pip install by @tonysy in #1324
- add support for hf_pulse_7b by @QXY716 in #1255
- [Fix] Update get_data_path for LCBench and HumanEval by @tonysy in #1375
- [Bug] Fix bug in turbomind by @tonysy in #1377
- [Fix] Fix version mismatch of CIBench by @kleinzcy in #1380
- [Fix] Fix InternLM2.5-7B-Chat-1M config by @DseidLi in #1383
- [Feature] Support import configs/models/summarizers from whl by @tonysy in #1376
- Calm dataset by @pengbo807 in #1385
- [Feature] Support OpenAI ChatCompletion by @tonysy in #1389
- [Fix] Fix slurm env by @tonysy in #1392
- [Fix] Fix CaLM import by @tonysy in #1395
- [Bump] Bump version for v0.3.0 by @tonysy in #1398
🎉 Welcome New Contributors
- @MaiziXiao made their first contribution in #1341
- @Quehry made their first contribution in #1315
- @WenjinW made their first contribution in #1363
- @heya5 made their first contribution in #1352
- @LeavittLang made their first contribution in #1218
- @pengbo807 made their first contribution in #1287
- @wangxingjun778 made their first contribution in #1289
- @QXY716 made their first contribution in #1255
Full Changelog: 0.2.6...0.3.0
OpenCompass v0.2.6
The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.6!
🌟 Highlights
- No noteworthy highlights.
🚀 New Features
📖 Documentation
🐛 Bug Fixes
- #1221 Resolve release version installation and import issues
- #1228 Fix pip version issues
- #1282 Update MathBench summarizer & fix cot setting
⚙ Enhancements and Refactors
- #1284 Reorganize subjective eval
🎉 Welcome New Contributors
- @mqy004, @sefira, @Zor-X-L and @baymax591 made their first contributions. Welcome to the OpenCompass community!
🔗 Full Change Logs
- [Fix] fix summarizer by @bittersweet1999 in #1217
- 解决release版本安装后不能导入opencompass.cli.main的问题 by @mqy004 in #1221
- MT-Bench-101 by @sefira in #1215
- [Feature] add dataset Fofo by @bittersweet1999 in #1224
- [Fix] fix pip version by @bittersweet1999 in #1228
- add ",<2.0.0" to "numpy>=1.23.4" in requirements/runtime.txt, as pand… by @Zor-X-L in #1267
- Support wildbench by @kleinzcy in #1266
- Add doc for accelerator function by @liushz in #1252
- flash attn installation in daily testcase by @zhulinJulia24 in #1272
- Update mtbench101.py by @sefira in #1276
- [Sync] Sync with internal codes 2024.06.28 by @Leymore in #1279
- Update MathBench summarizer & fix cot setting by @liushz in #1282
- npu适配 by @baymax591 in #1250
- [ci] update daily testcase by @zhulinJulia24 in #1285
- [Feature] Add InternLM2.5 by @tonysy in #1286
- [Feat] Update owners for issues by @tonysy in #1293
- [Refactor] Reorganize subjective eval by @bittersweet1999 in #1284
- [Doc] quick start swap tabs by @Leymore in #1263
Full Changelog: 0.2.5...0.2.6
OpenCompass v0.2.5
The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.5!
🌟 Highlights
- Simplify the huggingface / vllm / lmdeploy model wrapper.
meta_template
is no longer needed to be hand-crafted in model configs - Introduce evaluation results README in ~20 dataset config folders.
🚀 New Features
- #1065 Add LLaMA-3 Series Configs
- #1048 Add TheoremQA with 5-shot
- #1094 Support Math evaluation via judgemodel
- #1080 Add gpqa prompt from simple_evals, openai
- #1074 Add mmlu prompt from simple_evals, openai
- #1123 Add Qwen1.5 MoE 7b and Mixtral 8x22b model configs
📖 Documentation
- #1053 Update readme
- #1102 Update NeedleInAHaystack Docs
- #1110 Update README.md
- #1205 Remove --no-batch-padding and Use --hf-num-gpus
🐛 Bug Fixes
- #1036 Update setup.py install_requires
- #1051 Fixed the issue caused
- #1043 fix multiround
- #1070 Fix sequential runner
- #1079 Fix Llama-3 meta template
⚙ Enhancements and Refactors
- #1163 enable HuggingFacewithChatTemplate with --accelerator via cli
- #1104 fix prompt template
- #1109 Update performance of common benchmarks
🎉 Welcome New Contributors
- @liuwei130, @IcyFeather233, @VVVenus1212, @binary-husky, @dmitrysarov, @eltociear, @acylam, @lfy79001, @JuhaoLiang1997, @yaoyingyy, and @jxd0712 made their first contributions. Welcome to the OpenCompass community!
🔗 Full Change Logs
- [Fix] Update setup.py install_requires by @Leymore in #1036
- add ChemBench by @liuwei130 in #1032
- [Fix] logger.error -> logger.debug in OpenAI by @Leymore in #1050
- [Sync] Bump version to 0.2.4 by @Leymore in #1052
- [Doc] Update readme by @tonysy in #1053
- [fix]Fixed the issue caused by the repeated loading of VLLM model dur… by @IcyFeather233 in #1051
- [Sync] Sync with internal code 2024.04.19 by @Leymore in #1064
- [Fix] fix multiround by @bittersweet1999 in #1043
- [Feature] Add LLaMA-3 Series Configs by @Leymore in #1065
- [Feature] Add TheoremQA with 5-shot by @Leymore in #1048
- [Fix] Fix sequential runner by @Leymore in #1070
- Add lmdeploy tis python backend model by @ispobock in #1014
- Fix Llama-3 meta template by @liushz in #1079
- Add humaneval prompt from simple_evals, openai by @jingmingzhuo in #1076
- [Feature] Support Math evaluation via judgemodel by @bittersweet1999 in #1094
- [Feature] support arenahard evaluation by @bittersweet1999 in #1096
- Update CIBench by @kleinzcy in #1089
- [Feature] Add gpqa prompt from simple_evals, openai by @Francis-llgg in #1080
- [Deperecate] Remove multi-modal related stuff by @kennymckormick in #1072
- add vllm get_ppl by @VVVenus1212 in #1003
- fix: python path bug by @binary-husky in #1063
- fix output typing, change mutable list to immutable tuple by @dmitrysarov in #989
- [Doc] Update NeedleInAHaystack Docs by @DseidLi in #1102
- [Feature] add support for Flames datasets by @Yggdrasill7D6 in #1093
- adapt to lmdeploy v0.4.0 by @lvhan028 in #1073
- [Fix] fix prompt template by @bittersweet1999 in #1104
- [Fix] Fix Math Evaluation with Judge Model Evaluator & Add README by @liushz in #1103
- [Update] Update performance of common benchmarks by @tonysy in #1109
- [Fix] fix cmb dataset by @bittersweet1999 in #1106
- [Docs] Update README.md by @eltociear in #1110
- [Feature] Adding support for LLM Compression Evaluation by @acylam in #1108
- [Fix] remove redundant pre-commit check by @Leymore in #891
- fix LightllmApi workers bug by @helloyongyang in #1113
- [Feature] Add mmlu prompt from simple_evals, openai by @Leymore in #1074
- [Feature] update drop dataset from openai simple eval by @kleinzcy in #1092
- add mgsm datasets by @Yggdrasill7D6 in #1081
- [Fix] Fix AGIEval chinese sets by @xu-song in #972
- S3Eval Dataset by @lfy79001 in #916
- [Feature] Add AceGPT-MMLUArabic benchmark by @JuhaoLiang1997 in #1099
- [Fix] fix links by @bittersweet1999 in #1120
- [Fix] Fix NeedleBench Summarizer Typo by @DseidLi in #1125
- [Feature] Add Qwen1.5 MoE 7b and Mixtral 8x22b model configs by @acylam in #1123
- [Sync] Update accelerator by @Leymore in #1122
- [Fix] fix alpacaeval while add caching path by @bittersweet1999 in #1139
- [Fix] fix multiround by @bittersweet1999 in #1146
- [Fix] Fix Needlebench Summarizer by @DseidLi in #1143
- [Feature] Add huggingface apply_chat_template by @Leymore in #1098
- [Feat] Support dataset_suffix check for mixed configs by @xu-song in #973
- [Format] Add some config lints by @Leymore in #892
- [Sync] Sync with internal codes 2024.05.14 by @Leymore in #1156
- [Fix] fix arenahard summarizer by @bittersweet1999 in #1154
- [Fix] use ProcessPoolExecutor during mbpp eval by @Leymore in #1159
- [Fix] Update stop_words in huggingface_above_v4_33 by @Leymore in #1160
- Update accelerator by @liushz in #1152
- [Feat] enable HuggingFacewithChatTemplate with --accelerator via cli by @Leymore in #1163
- update test workflow by @zhulinJulia24 in #1167
- [Sync] Sync with internal codes 2024.05.17 by @Leymore in #1171
- add dependency in daily test workflow by @zhulinJulia24 in #1173
- [Sync] Sync with internal codes 2024.05.21.1 by @Leymore in #1175
- Update MathBench by @liushz in #1176
- [Fix] fix template by @bittersweet1999 in #1178
- Fix a bug in drop_gen.py by @kleinzcy in #1191
- [Fix] temporary files using tempfile by @yaoyingyy in #1186
- [Fix] add support for lmdeploy api judge by @bittersweet1999 in #1193
- [Fix] fix length by @bittersweet1999 in #1180
- support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks by @jxd0712 in #1190
- [Feat] Update charm summary by @Leymore in #1194
- Update accelerator by @liushz in #1195
- [Sync] S...