Evalscope

Latest version: v0.13.2

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 4

0.13.2

更新
- 新增支持MMLU_Redux, AlpacaEval 和 ArenaHard 三个评测基准，使用注意事项请查看[文档](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset.html)
- `general_qa` 支持设置system字段
- `evalscope perf`对齐vLLM官方benchmarking，支持`extra_args`
- 移除多余的依赖项
- 修复RAGEval报错的问题

Update
- Supported three evaluation benchmarks: MMLU_Redux, AlpacaEval, and ArenaHard. Please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset.html) for usage instructions.
- `general_qa` now supports setting the system field.
- `evalscope perf` is aligned with the vLLM official benchmarking and supports `extra_args`.
- Removed unnecessary dependencies.
- Fixed the issue causing errors with RAGEval.

What's Changed
* fix rageval local model by Yunnglin in https://github.com/modelscope/evalscope/pull/429
* update vlmeval doc by Yunnglin in https://github.com/modelscope/evalscope/pull/430
* Fix general qa by Yunnglin in https://github.com/modelscope/evalscope/pull/433
* fix perf stream args by Yunnglin in https://github.com/modelscope/evalscope/pull/434
* add perf extra args by Yunnglin in https://github.com/modelscope/evalscope/pull/438
* support genera qa system by Yunnglin in https://github.com/modelscope/evalscope/pull/439
* fix chat_adapter by Yunnglin in https://github.com/modelscope/evalscope/pull/440
* support mmlu_redux benchmark by Yunnglin in https://github.com/modelscope/evalscope/pull/428
* Fix perf request by Yunnglin in https://github.com/modelscope/evalscope/pull/445
* Update requirements by Yunnglin in https://github.com/modelscope/evalscope/pull/446
* Add AlpacaEval and ArenaHard by Yunnglin in https://github.com/modelscope/evalscope/pull/437

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.13.1...v0.13.2

0.13.1

新功能
- 模型推理服务压测支持random生成指定范围长度的prompt，参考[使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/examples.html#random)
- 兼容ms-swift训练框架训练中评测，[参考](https://swift.readthedocs.io/zh-cn/latest/Instruction/%E8%AF%84%E6%B5%8B.html#id5)
- 修复框架稳定性相关bug

New Features
- The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/examples.html#random) for more details.
- Compatible with evaluation during training using the ms-swift training framework. [Reference](https://swift.readthedocs.io/en/latest/Instruction/Evaluation.html#evaluation-during-training)
- Fixed bugs related to framework stability.

What's Changed
* compat swift by Yunnglin in https://github.com/modelscope/evalscope/pull/397
* Update perf random dataset by Yunnglin in https://github.com/modelscope/evalscope/pull/399
* fix dump config by Yunnglin in https://github.com/modelscope/evalscope/pull/401
* Filter illegal characters & Use asyncio.new_event_loop() instead of deprecated get_event_loop() by Mengxi12345 in https://github.com/modelscope/evalscope/pull/411
* reduce save items and update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/406
* Fix loop eval by Yunnglin in https://github.com/modelscope/evalscope/pull/415
* add perf no test connection by Yunnglin in https://github.com/modelscope/evalscope/pull/417
* update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/422

New Contributors
* Mengxi12345 made their first contribution in https://github.com/modelscope/evalscope/pull/411

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.13.0...v0.13.1

0.13.0

新功能
- 支持LLM-as-a-Judge进行评测，使用大模型来进行打分，参考[相关参数](https://evalscope.readthedocs.io/zh-cn/latest/get_started/parameters.html#judge)
- 新增支持 SimpleQA, Chinese SimpleQA, LiveCodeBench 三个评测基准，前两个需要指定judge 模型来评测，参考[使用示例](https://evalscope.readthedocs.io/zh-cn/latest/get_started/basic_usage.html#id9)

New Features
- Support for LLM-as-a-Judge evaluation, using large language models for scoring. Refer to [relevant parameters](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html#judge)
- Added support for three new evaluation benchmarks: SimpleQA, Chinese SimpleQA, and LiveCodeBench. The first two require specifying a judge model for evaluation. See [usage examples](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#id9)

What's Changed
* Add judge model, support simple qa and chinese simple qa by Yunnglin in https://github.com/modelscope/evalscope/pull/383
* suppor general judge by Yunnglin in https://github.com/modelscope/evalscope/pull/385
* Add livecodebench by Yunnglin in https://github.com/modelscope/evalscope/pull/386

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.12.1...v0.13.0

0.12.1

新功能
1. 新增最佳实践 [评测QwQ-32B和DeepSeek-R1模型](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html) 包括模型推理能力测试和思考效率测试。
2. 新增支持SuperGPQA评测基准，指定`super_gpqa`来使用。
3. 多选题评测集支持指定`generation` 或 `logits`模式。
4. 支持模型输出结果后处理过滤器，目前支持：
- `remove_until {string}` 过滤掉模型输出结果中指定字符串之前的部分。
- `extract {regex}` 提取模型输出结果中指定正则表达式匹配的部分。
5. 支持模型服务中的`reasoning_content`字段

New Features
1. The multiple-choice question assessment set now supports specifying `generation` or `logits` mode.
2. Support for post-processing filters on model output results, currently including:
- `remove_until {string}`: Filters out the part of the model output before the specified string.
- `extract {regex}`: Extracts the portion of the model output that matches the specified regular expression.
3. Support for the `reasoning_content` field in the model service.
4. New support for the SuperGPQA evaluation benchmark.
5. Added best practices for [Evaluating QwQ Models](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html), including tests for model reasoning capabilities and thinking efficiency.

What's Changed
* update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/343
* fix stream finish reason by Yunnglin in https://github.com/modelscope/evalscope/pull/347
* Update eval think by Yunnglin in https://github.com/modelscope/evalscope/pull/348
* update IQ/EQ doc by Yunnglin in https://github.com/modelscope/evalscope/pull/354
* support `generation` and `logits` output for benchmarks by Yunnglin in https://github.com/modelscope/evalscope/pull/358
* fix/set_use_cache_remove_reviews_dir_bug by x22x22 in https://github.com/modelscope/evalscope/pull/359
* add super gpqa by Yunnglin in https://github.com/modelscope/evalscope/pull/361
* Update download datasts doc by Yunnglin in https://github.com/modelscope/evalscope/pull/363
* Compat reasoning model and support filter by Yunnglin in https://github.com/modelscope/evalscope/pull/370
* fix typo in README_zh by yabea in https://github.com/modelscope/evalscope/pull/372
* Update arguments.py by xuhanxiao0624 in https://github.com/modelscope/evalscope/pull/364
* add qwq eval doc by Yunnglin in https://github.com/modelscope/evalscope/pull/376
* fix model path and encoding error by Yunnglin in https://github.com/modelscope/evalscope/pull/380
* fix tool bench by Yunnglin in https://github.com/modelscope/evalscope/pull/379

New Contributors
* x22x22 made their first contribution in https://github.com/modelscope/evalscope/pull/359
* yabea made their first contribution in https://github.com/modelscope/evalscope/pull/372
* xuhanxiao0624 made their first contribution in https://github.com/modelscope/evalscope/pull/364

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.12.0...v0.12.1

0.12.0

新功能
- 新增支持评测推理模型的思考效率，参考[📖思考效率评测最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html)，该实现参考了[Overthinking](https://doi.org/10.48550/arXiv.2412.21187) 和 [Underthinking](https://doi.org/10.48550/arXiv.2501.18585)两篇工作。
- 新增支持[AIME25](https://www.modelscope.cn/datasets/TIGER-Lab/AIME25), [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR), [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary)三个模型推理相关评测基准。
- 支持评测时使用stream模式、指定请求超时时间、支持mps设备本地评测。

New Features
- Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
- Added support for three model inference benchmarks: [AIME25](https://www.modelscope.cn/datasets/TIGER-Lab/AIME25), [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR), and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary).
- Supports using stream mode during evaluation, specifying request timeout, and local evaluation on mps devices.

What's Changed
* update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/308
* Update/docs by Yunnglin in https://github.com/modelscope/evalscope/pull/309
* Update doc and limit plotly version by Yunnglin in https://github.com/modelscope/evalscope/pull/312
* add AIME 2025 by Yunnglin in https://github.com/modelscope/evalscope/pull/313
* add perf top_k by Yunnglin in https://github.com/modelscope/evalscope/pull/317
* fix TPOP, report name comflict, query template by Yunnglin in https://github.com/modelscope/evalscope/pull/321
* fix 323 by Yunnglin in https://github.com/modelscope/evalscope/pull/325
* compat device by Yunnglin in https://github.com/modelscope/evalscope/pull/329
* use openai package by Yunnglin in https://github.com/modelscope/evalscope/pull/326
* Fix warning perf usage by Yunnglin in https://github.com/modelscope/evalscope/pull/331
* Add benchmark: `musr` and `process bench` by Yunnglin in https://github.com/modelscope/evalscope/pull/324
* add max token for connection by Yunnglin in https://github.com/modelscope/evalscope/pull/335
* fix stream by Yunnglin in https://github.com/modelscope/evalscope/pull/337
* Add think eval by Yunnglin in https://github.com/modelscope/evalscope/pull/316
* Fix device map by Yunnglin in https://github.com/modelscope/evalscope/pull/342

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.11.0...v0.12.0

0.11.0

新功能
1. 支持评测DeepSeek-R1类模型数学推理能力，详见[最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html)
2. 支持`eval_batch_size`参数，加速模型评测
3. 支持设置测试集的`prompt_template`, `system_prompt`, `metrics_list`参数

---

New Feature
1. Support for DeepSeek-R1 type models' mathematical reasoning capabilities. For details, see [Best Practices](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html).
2. Support for the `eval_batch_size` parameter to accelerate model evaluation.
3. Support for setting `prompt_template`, `system_prompt`, and `metrics_list` parameters for the test set.

What's Changed
* set default stop to list by Yunnglin in https://github.com/modelscope/evalscope/pull/296
* update datasets version by wangxingjun778 in https://github.com/modelscope/evalscope/pull/297
* Add ds distill collection by Yunnglin in https://github.com/modelscope/evalscope/pull/298
* update custom general mcq by Yunnglin in https://github.com/modelscope/evalscope/pull/299
* fix viz html label and num of sample by Yunnglin in https://github.com/modelscope/evalscope/pull/300
* support load collection from remote by Yunnglin in https://github.com/modelscope/evalscope/pull/303
* update perf doc by Yunnglin in https://github.com/modelscope/evalscope/pull/305
* support multi metrics and system prompt by Yunnglin in https://github.com/modelscope/evalscope/pull/306

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.10.1...v0.11.0

Page 1 of 4

Releases

Has known vulnerabilities

Evalscope

Page 1 of 4

0.13.2

0.13.1

0.13.0

0.12.1

0.12.0

0.11.0

Page 1 of 4

Links

Releases