Evalscope

Latest version: v0.13.0

Safety actively analyzes 714973 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.13.0

新功能
- 支持LLM-as-a-Judge进行评测,使用大模型来进行打分,参考[相关参数](https://evalscope.readthedocs.io/zh-cn/latest/get_started/parameters.html#judge)
- 新增支持 SimpleQA, Chinese SimpleQA, LiveCodeBench 三个评测基准,前两个需要指定judge 模型来评测,参考[使用示例](https://evalscope.readthedocs.io/zh-cn/latest/get_started/basic_usage.html#id9)

New Features
- Support for LLM-as-a-Judge evaluation, using large language models for scoring. Refer to [relevant parameters](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html#judge)
- Added support for three new evaluation benchmarks: SimpleQA, Chinese SimpleQA, and LiveCodeBench. The first two require specifying a judge model for evaluation. See [usage examples](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#id9)

What's Changed
* Add judge model, support simple qa and chinese simple qa by Yunnglin in https://github.com/modelscope/evalscope/pull/383
* suppor general judge by Yunnglin in https://github.com/modelscope/evalscope/pull/385
* Add livecodebench by Yunnglin in https://github.com/modelscope/evalscope/pull/386


**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.12.1...v0.13.0

0.12.1

新功能
1. 新增最佳实践 [评测QwQ-32B和DeepSeek-R1模型](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html) 包括模型推理能力测试和思考效率测试。
2. 新增支持SuperGPQA评测基准,指定`super_gpqa`来使用。
3. 多选题评测集支持指定`generation` 或 `logits`模式。
4. 支持模型输出结果后处理过滤器,目前支持:
- `remove_until {string}` 过滤掉模型输出结果中指定字符串之前的部分。
- `extract {regex}` 提取模型输出结果中指定正则表达式匹配的部分。
5. 支持模型服务中的`reasoning_content`字段

New Features
1. The multiple-choice question assessment set now supports specifying `generation` or `logits` mode.
2. Support for post-processing filters on model output results, currently including:
- `remove_until {string}`: Filters out the part of the model output before the specified string.
- `extract {regex}`: Extracts the portion of the model output that matches the specified regular expression.
3. Support for the `reasoning_content` field in the model service.
4. New support for the SuperGPQA evaluation benchmark.
5. Added best practices for [Evaluating QwQ Models](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html), including tests for model reasoning capabilities and thinking efficiency.

What's Changed
* update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/343
* fix stream finish reason by Yunnglin in https://github.com/modelscope/evalscope/pull/347
* Update eval think by Yunnglin in https://github.com/modelscope/evalscope/pull/348
* update IQ/EQ doc by Yunnglin in https://github.com/modelscope/evalscope/pull/354
* support `generation` and `logits` output for benchmarks by Yunnglin in https://github.com/modelscope/evalscope/pull/358
* fix/set_use_cache_remove_reviews_dir_bug by x22x22 in https://github.com/modelscope/evalscope/pull/359
* add super gpqa by Yunnglin in https://github.com/modelscope/evalscope/pull/361
* Update download datasts doc by Yunnglin in https://github.com/modelscope/evalscope/pull/363
* Compat reasoning model and support filter by Yunnglin in https://github.com/modelscope/evalscope/pull/370
* fix typo in README_zh by yabea in https://github.com/modelscope/evalscope/pull/372
* Update arguments.py by xuhanxiao0624 in https://github.com/modelscope/evalscope/pull/364
* add qwq eval doc by Yunnglin in https://github.com/modelscope/evalscope/pull/376
* fix model path and encoding error by Yunnglin in https://github.com/modelscope/evalscope/pull/380
* fix tool bench by Yunnglin in https://github.com/modelscope/evalscope/pull/379

New Contributors
* x22x22 made their first contribution in https://github.com/modelscope/evalscope/pull/359
* yabea made their first contribution in https://github.com/modelscope/evalscope/pull/372
* xuhanxiao0624 made their first contribution in https://github.com/modelscope/evalscope/pull/364

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.12.0...v0.12.1

0.12.0

新功能
- 新增支持评测推理模型的思考效率,参考[📖思考效率评测最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html),该实现参考了[Overthinking](https://doi.org/10.48550/arXiv.2412.21187) 和 [Underthinking](https://doi.org/10.48550/arXiv.2501.18585)两篇工作。
- 新增支持[AIME25](https://www.modelscope.cn/datasets/TIGER-Lab/AIME25), [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR), [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary)三个模型推理相关评测基准。
- 支持评测时使用stream模式、指定请求超时时间、支持mps设备本地评测。

New Features
- Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
- Added support for three model inference benchmarks: [AIME25](https://www.modelscope.cn/datasets/TIGER-Lab/AIME25), [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR), and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary).
- Supports using stream mode during evaluation, specifying request timeout, and local evaluation on mps devices.

What's Changed
* update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/308
* Update/docs by Yunnglin in https://github.com/modelscope/evalscope/pull/309
* Update doc and limit plotly version by Yunnglin in https://github.com/modelscope/evalscope/pull/312
* add AIME 2025 by Yunnglin in https://github.com/modelscope/evalscope/pull/313
* add perf top_k by Yunnglin in https://github.com/modelscope/evalscope/pull/317
* fix TPOP, report name comflict, query template by Yunnglin in https://github.com/modelscope/evalscope/pull/321
* fix 323 by Yunnglin in https://github.com/modelscope/evalscope/pull/325
* compat device by Yunnglin in https://github.com/modelscope/evalscope/pull/329
* use openai package by Yunnglin in https://github.com/modelscope/evalscope/pull/326
* Fix warning perf usage by Yunnglin in https://github.com/modelscope/evalscope/pull/331
* Add benchmark: `musr` and `process bench` by Yunnglin in https://github.com/modelscope/evalscope/pull/324
* add max token for connection by Yunnglin in https://github.com/modelscope/evalscope/pull/335
* fix stream by Yunnglin in https://github.com/modelscope/evalscope/pull/337
* Add think eval by Yunnglin in https://github.com/modelscope/evalscope/pull/316
* Fix device map by Yunnglin in https://github.com/modelscope/evalscope/pull/342


**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.11.0...v0.12.0

0.11.0

新功能
1. 支持评测DeepSeek-R1类模型数学推理能力,详见[最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html)
2. 支持`eval_batch_size`参数,加速模型评测
3. 支持设置测试集的`prompt_template`, `system_prompt`, `metrics_list`参数

---

New Feature
1. Support for DeepSeek-R1 type models' mathematical reasoning capabilities. For details, see [Best Practices](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html).
2. Support for the `eval_batch_size` parameter to accelerate model evaluation.
3. Support for setting `prompt_template`, `system_prompt`, and `metrics_list` parameters for the test set.

What's Changed
* set default stop to list by Yunnglin in https://github.com/modelscope/evalscope/pull/296
* update datasets version by wangxingjun778 in https://github.com/modelscope/evalscope/pull/297
* Add ds distill collection by Yunnglin in https://github.com/modelscope/evalscope/pull/298
* update custom general mcq by Yunnglin in https://github.com/modelscope/evalscope/pull/299
* fix viz html label and num of sample by Yunnglin in https://github.com/modelscope/evalscope/pull/300
* support load collection from remote by Yunnglin in https://github.com/modelscope/evalscope/pull/303
* update perf doc by Yunnglin in https://github.com/modelscope/evalscope/pull/305
* support multi metrics and system prompt by Yunnglin in https://github.com/modelscope/evalscope/pull/306


**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.10.1...v0.11.0

0.10.1

What's Changed
* Add visualization examples, support interface language switching between Chinese and English by Yunnglin in https://github.com/modelscope/evalscope/pull/289, https://github.com/modelscope/evalscope/pull/294
* Add GPQA benchmark by Yunnglin in https://github.com/modelscope/evalscope/pull/293
* Fix ifeval dependency by Yunnglin in https://github.com/modelscope/evalscope/pull/292
* Fix viz subset by Yunnglin in https://github.com/modelscope/evalscope/pull/295

更新内容
* 添加可视化示例,支持界面中英文切换 Yunnglin 在 https://github.com/modelscope/evalscope/pull/289, https://github.com/modelscope/evalscope/pull/294
* 添加 GPQA 评测基准 Yunnglin 在 https://github.com/modelscope/evalscope/pull/293
* 修复 ifeval 依赖 Yunnglin 在 https://github.com/modelscope/evalscope/pull/292
* 修复可视化模型预测结果的bug Yunnglin 在 https://github.com/modelscope/evalscope/pull/295

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.10.0...v0.10.1

0.10.0

What's Changed
Feat: Add EvalScope dashboard by Yunnglin in https://github.com/modelscope/evalscope/pull/277
- Including single-model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details
Others
* Add `model-id` in arguments by Yunnglin in https://github.com/modelscope/evalscope/pull/274
* Add `ifeval` and unify report format by Yunnglin in https://github.com/modelscope/evalscope/pull/275
* Add `iquiz` and use first metric by default for multi metrics by Yunnglin in https://github.com/modelscope/evalscope/pull/288
* Support specifying system prompt by Yunnglin in https://github.com/modelscope/evalscope/pull/283
* Bug-fix multi-metrics dataset by Yunnglin in https://github.com/modelscope/evalscope/pull/282
* Bug-fix mmlu read local data by Yunnglin in https://github.com/modelscope/evalscope/pull/273

功能更新
主要更新
* 添加评测报告可视化,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/277 中实现
- 包括单模型评估结果和多模型对比,更多详情请参考 [📖 可视化评估结果](https://evalscope.readthedocs.io/zh-cn/latest/get_started/visualization.html)
其他
* 在参数中添加 `model-id`,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/274 中实现
* 添加 `ifeval` 评测基准;并统一报告格式,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/275 中实现
* 添加 `iquiz`评测基准;支持多指标的评测集在展示结果时默认使用第一个指标的结果,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/288 中实现
* 支持指定system prompt,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/283 中实现
* 修复多指标数据集的错误,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/282 中实现
* 修复 mmlu 读取本地数据的问题,由 Yunnglin 在 https://github.com/modelscope/evalscope/pull/273 中实现

**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.9.0...v0.10.0

Page 1 of 4

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.