Evalplus

Latest version: v0.3.1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 2

0.1.3

Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.3/images/sha256-fd13ab6ee2aa313eb160fc29debe8c761804cb6af7309280b4e200b6549bd75a

0.1.2

Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.2/images/sha256-747ae02f0bfbd300c0205298113006203d984373e6ab6b8fb3048626f41dbe08

0.1.1

Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.1/images/sha256-4993a0dc0ec13d6fe88eb39f94dd0a927e1f26864543c8c13e2e8c5d5c347af0

0.1.0

What is this?

In addition to the initial version of EvalPlus source-code, we release the pre-generated code of LLMs on HumanEval+ (also applicable for base HumanEval) and regularized ground-truth solutions. With these we hope to accelerate future research where research may try to reuse our pre-generated code instead of generating them from scratch.

- `${MODEL_NAME}_temp_${TEMPERATURE}.zip`: LLM-produced program samples
- `HumanEvalPlusGT.zip`: The re-implemented ground-truth solutions

Data sources

The configuration of the pre-generated code follows our pre-print paper: https://arxiv.org/abs/2305.01210

- We evaluated it over:
- **x 14 models** (10 model types)
- **x 5 temperature** settings including zero temperature (for greedy decoding) as well as `{0.2, 0.4, 0.6, 0.8}`
- **x 200 code samples** used random sampling (i.e., non-greedy decoding settings)
- We use nucleus sampling with top p = 0.95 for all hugging-face based model
- Codegen6B and Codegen16B is accelerated by [FauxPilot](https://github.com/fauxpilot/fauxpilot) (thanks!)

![image](https://user-images.githubusercontent.com/38074777/235818976-b58220ae-3038-4a09-991e-ebdc885c53d8.png)

Evaluated results

We draw the results from the samples and test-cases from the base HumanEval and our enhanced HumanEval+:

![image](https://user-images.githubusercontent.com/38074777/235819326-77192614-64ac-40c3-a2c6-5ca325296a77.png)

Call for contribution

We also encourage open-source developers to contributing to LLM4Code research by: (i) reproducing and validating our results; (ii) uploading LLM-generated samples and reproducing the results of new models; and of course (iii) trying out our enhanced dataset to get more accurate and trustworthy results!

Page 2 of 2

Releases

Has known vulnerabilities

Evalplus

Page 2 of 2

0.1.3

0.1.2

0.1.1

0.1.0

Page 2 of 2

Links

Releases