Bigcodebench

Latest version: v0.2.5

Safety actively analyzes 723929 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.2.4

What's Changed
* fix make_raw_chat_prompt when prefill is disabled by zhangchen-xu in https://github.com/bigcode-project/bigcodebench/pull/75
* Specify a unique cache directory before each code execution by shwinshaker in https://github.com/bigcode-project/bigcodebench/pull/77
* fix E2b execution debug by terryyz in https://github.com/bigcode-project/bigcodebench/pull/79
* fix e2b by terryyz in https://github.com/bigcode-project/bigcodebench/pull/80
* Add support for Hugging Face Serverless Inference by hvaara in https://github.com/bigcode-project/bigcodebench/pull/85
* Reintroduce progress checker from 48 by hvaara in https://github.com/bigcode-project/bigcodebench/pull/86
* Fixes for tasks 211 and 215 by hvaara in https://github.com/bigcode-project/bigcodebench/pull/49

New Contributors
* zhangchen-xu made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/75
* shwinshaker made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/77

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.2.3...v0.2.4

0.2.3.post1

What's Changed
- Fix Docker image and its dependencies
- Support more models with reasoning effort
- Optional chat prefilling
- E2B, Gradio, and Local code execution

Evaluated LLMs (173 models)
- o3-mini
- DeepSeek R1

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post7...v0.2.3.post1

0.2.1.post7

What's Changed
- Fix Docker image and its dependencies
- Fix o1 concurrent generation output collection
- Update the code sanitization

Evaluated LLMs (157 models)
- o1-2024-12-17
- Gemini-2.0 series

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post3...v0.2.1.post7

0.2.1.post2

What's Changed
- Fix `calibration` setting in the code evaluation.
- Add `--no_execute` argument for code evaluation.
- Support concurrent API inference for `o1` and `deepseek-chat`.
- Fix API inference for Google Gemini.
- Add `--instruction_prefix` and `--response_prefix` arguments for code generation.
- Change `--id_range` input type.
- Add `--revision` arguments for code generation.

Evaluated LLMs (144 models)
- Qwen2.5-Coder-32B-Instruct
- grok-beta
- claude-3-5-haiku-20241022

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.2.0...v0.2.1.post2

0.2.0.post3

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.1.9...v0.2.0.post3

0.1.9

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.9

Page 1 of 2

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.