Breaking changes
- Updated the minimal supported Python version to 3.9.
- All prompts for our built-in LLM based metrics are updated. Now all of them have `Output your thought process first, and then provide your final answer` as the last sentence to make sure LLM evaluators actually do the chain-of-thought reasoning. This may affect the output scores as well.
- Fixed a typo in a module name. Now `langcheck.utils.progess_bar` is renamed to `langcheck.utils.progress_bar`.
- Default prompts for `langcheck.en.toxicity` and `langcheck.ja.toxicity` are updated. Refer to 136 for comparison with the original prompt. You can fallback to the old prompts by specifying `eval_prompt_version="v1"` as an argument.
- Updated the arguments for `langcheck.augment.rephrase`. Now they take `EvalClient`s instead of directly taking OpenAI parameters.
New Features
- Added [langcheck.metrics.custom_text_quality](https://langcheck.readthedocs.io/en/latest/langcheck.metrics.custom_text_quality.html#module-langcheck.metrics.custom_text_quality). With functions in this module, you can build your own LLM-based metrics with custom prompts. See the documentation for details.
- Added support of some local LLMs as evaluators
- [LlamaEvalClient](https://langcheck.readthedocs.io/en/latest/langcheck.metrics.eval_clients.html#langcheck.metrics.eval_clients.LlamaEvalClient)
- [PrometheusEvalClient](https://langcheck.readthedocs.io/en/latest/langcheck.metrics.eval_clients.html#langcheck.metrics.eval_clients.PrometheusEvalClient)
- Added new text augmentations
- `jailbreak_templates` augmentation with the following templates
- `basic`, `chatgpt_dan`, `chatgpt_good_vs_evil`, `john` and `universal_adversarial_suffix` (EN)
- `basic`, `chatgpt_good_vs_evil` and `john` (JA)
- `payload_splitting` (EN, JA)
- `to_full_width` (EN)
- `conv_kana` (JA)
- Added new LLM-based built-in metrics for both EN & JA languages
- `answer_correctness`
- `answer_safety`
- `personal_data_leakage`
- `hate_speech`
- `adult_content`
- `harmful_activity`
- Added "Simulated Annotators", a confidence score estimating method proposed in paper [Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement](https://arxiv.org/abs/2407.18370). You can use that by adding `calculated_confidence=True` for `langcheck.metrics.en.pairwise_comparison`.
- Supported embedding-based metrics (e.g. `semantic_similarity`) with async OpenAI-based eval clients.
Bug Fixes
- Added error handling code in `OpenAIEvalClient` and `GeminiAIEvalClient` so that they just return `None` even if they fail in the function calling step.
- Updated `langcheck.metrics.pairwise_comparison` to accept lists with `None` as source texts.
- Fixed an error in `langcheck.augment.synonym` caused by a missing `nltk` package.
- Fixed the issue on decoding UTF-8 texts in some environments.
- Fixed typos in documentation.
**Full Changelog**: https://github.com/citadel-ai/langcheck/compare/v0.7.1...v0.8.0