What is this?
In addition to the initial version of EvalPlus source-code, we release the pre-generated code of LLMs on HumanEval+ (also applicable for base HumanEval) and regularized ground-truth solutions. With these we hope to accelerate future research where research may try to reuse our pre-generated code instead of generating them from scratch.
- `${MODEL_NAME}_temp_${TEMPERATURE}.zip`: LLM-produced program samples
- `HumanEvalPlusGT.zip`: The re-implemented ground-truth solutions
Data sources
The configuration of the pre-generated code follows our pre-print paper: https://arxiv.org/abs/2305.01210
- We evaluated it over:
- **x 14 models** (10 model types)
- **x 5 temperature** settings including zero temperature (for greedy decoding) as well as `{0.2, 0.4, 0.6, 0.8}`
- **x 200 code samples** used random sampling (i.e., non-greedy decoding settings)
- We use nucleus sampling with top p = 0.95 for all hugging-face based model
- Codegen6B and Codegen16B is accelerated by [FauxPilot](https://github.com/fauxpilot/fauxpilot) (thanks!)
![image](https://user-images.githubusercontent.com/38074777/235818976-b58220ae-3038-4a09-991e-ebdc885c53d8.png)
Evaluated results
We draw the results from the samples and test-cases from the base HumanEval and our enhanced HumanEval+:
![image](https://user-images.githubusercontent.com/38074777/235819326-77192614-64ac-40c3-a2c6-5ca325296a77.png)
Call for contribution
We also encourage open-source developers to contributing to LLM4Code research by: (i) reproducing and validating our results; (ii) uploading LLM-generated samples and reproducing the results of new models; and of course (iii) trying out our enhanced dataset to get more accurate and trustworthy results!