Bigcodebench

Latest version: v0.2.1.post1

Safety actively analyzes 681775 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.2.1.post1

What's Changed
- Fix `calibration` setting in the code evaluation.
- Add `--no_execute` argument for code evaluation.
- Support concurrent API inference for `o1` and `deepseek-chat`.
- Fix API inference for Google Gemini.
- Add `--instruction_prefix` and `--response_prefix` arguments for code generation.
- Change `--id_range` input type.
- Add `--revision` arguments for code generation.

Evaluated LLMs (144 models)
- Qwen2.5-Coder-32B-Instruct
- grok-beta
- claude-3-5-haiku-20241022

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.2.0...v0.2.1.post1

0.2.0.post3

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.1.9...v0.2.0.post3

0.1.9

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.9

0.1.8

Features:
- Support `BigCodeBench-Hard` subset: https://github.com/bigcode-project/bigcodebench/pull/17
- Identify and fix tokenizer setup: https://github.com/bigcode-project/bigcodebench/issues/21
- Customize the tokenizer: https://github.com/bigcode-project/bigcodebench/pull/20
- Add the pass rate result log: https://github.com/bigcode-project/bigcodebench/pull/20

Contributors:
- marianna13: https://github.com/bigcode-project/bigcodebench/pull/20

Models:
- A total of 96 models at the time of the release

Acknowledgement:
- ethanc8
- takkyu2
- imamnurby

**Full Changelog**: https://github.com/bigcode-project/bigcodebench/compare/v0.1.7...v0.1.8

0.1.7.post2

- Enhanced the calculation of ground truth pass rate, and addressed the issue mentioned in https://github.com/bigcode-project/bigcodebench/pull/12#issuecomment-2199186199.
- Update the README docs.

0.1.7

Fix some identified issues:
- The ground truth pass rate was not previously computed in the correct way.
- Passed RAM limits would raise errors, as they were set as float type.
- User permission is not correctly set up in the Evaluate Docker.

Features:
-- `check-gt-only` will print out the pass rate when finishing.

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.