For deepeval's latest release v0.21.15, we release:
- Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
- caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the `-c` flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache
- repeats. If you want to repeat each test case for statistical significant, use the `-r` flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats
- LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code.
- G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.