What's Changed - Fix `calibration` setting in the code evaluation. - Add `--no_execute` argument for code evaluation. - Support concurrent API inference for `o1` and `deepseek-chat`. - Fix API inference for Google Gemini. - Add `--instruction_prefix` and `--response_prefix` arguments for code generation. - Change `--id_range` input type. - Add `--revision` arguments for code generation.
- Enhanced the calculation of ground truth pass rate, and addressed the issue mentioned in https://github.com/bigcode-project/bigcodebench/pull/12#issuecomment-2199186199. - Update the README docs.
0.1.7
Fix some identified issues: - The ground truth pass rate was not previously computed in the correct way. - Passed RAM limits would raise errors, as they were set as float type. - User permission is not correctly set up in the Evaluate Docker.
Features: -- `check-gt-only` will print out the pass rate when finishing.