新功能
- 新增支持评测推理模型的思考效率,参考[📖思考效率评测最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html),该实现参考了[Overthinking](https://doi.org/10.48550/arXiv.2412.21187) 和 [Underthinking](https://doi.org/10.48550/arXiv.2501.18585)两篇工作。
- 新增支持[AIME25](https://www.modelscope.cn/datasets/TIGER-Lab/AIME25), [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR), [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary)三个模型推理相关评测基准。
- 支持评测时使用stream模式、指定请求超时时间、支持mps设备本地评测。
New Features
- Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
- Added support for three model inference benchmarks: [AIME25](https://www.modelscope.cn/datasets/TIGER-Lab/AIME25), [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR), and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary).
- Supports using stream mode during evaluation, specifying request timeout, and local evaluation on mps devices.
What's Changed
* update doc by Yunnglin in https://github.com/modelscope/evalscope/pull/308
* Update/docs by Yunnglin in https://github.com/modelscope/evalscope/pull/309
* Update doc and limit plotly version by Yunnglin in https://github.com/modelscope/evalscope/pull/312
* add AIME 2025 by Yunnglin in https://github.com/modelscope/evalscope/pull/313
* add perf top_k by Yunnglin in https://github.com/modelscope/evalscope/pull/317
* fix TPOP, report name comflict, query template by Yunnglin in https://github.com/modelscope/evalscope/pull/321
* fix 323 by Yunnglin in https://github.com/modelscope/evalscope/pull/325
* compat device by Yunnglin in https://github.com/modelscope/evalscope/pull/329
* use openai package by Yunnglin in https://github.com/modelscope/evalscope/pull/326
* Fix warning perf usage by Yunnglin in https://github.com/modelscope/evalscope/pull/331
* Add benchmark: `musr` and `process bench` by Yunnglin in https://github.com/modelscope/evalscope/pull/324
* add max token for connection by Yunnglin in https://github.com/modelscope/evalscope/pull/335
* fix stream by Yunnglin in https://github.com/modelscope/evalscope/pull/337
* Add think eval by Yunnglin in https://github.com/modelscope/evalscope/pull/316
* Fix device map by Yunnglin in https://github.com/modelscope/evalscope/pull/342
**Full Changelog**: https://github.com/modelscope/evalscope/compare/v0.11.0...v0.12.0