Scandeval

Latest version: v12.10.8

Safety actively analyzes 642283 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 21

12.5.0

Added
- We now support evaluation of quantised models, such as GPTQ and AWQ, when the vLLM
backend is being used (the default).

Fixed
- Move tensor to the correct device when benchmarking seq-to-seq models (363). Thanks
to [ThomasKluiters](https://github.com/ThomasKluiters) for this contribution! :tada:
- Deals with the case where an instruction tuned model does not use any special token
at the end of the chat, such as `<|im_end|>`. This holds for, e.g., Qwen models.
- Better auto-detection of pipeline tag for models on the Hugging Face Hub, in case the
tag is not manually set.

12.4.0

Added
- Support for Azure OpenAI models! These can now be benchmarked as with any other
model, where either the environment variables `AZURE_OPENAI_API_KEY`,
`AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_VERSION` need to have been set, or
alternatively through the `--azure-openai-api-key`, `--azure-openai-endpoint` and
`--azure-openai-api-version` arguments. Thanks to
[BramVanroy](https://github.com/BramVanroy) for all the help regarding the
implementation of this :tada:
- We now use the new JSON mode for newer OpenAI models for the NER task, to ensure
better JSON generation.
- If an error is thrown during generation with an OpenAI model, which for instance
happens when the prompt is caught by the content filter, then we simply return a
blank string instead.

Changed
- Updated `outlines` dependency to v0.0.37, which can now correctly deal with a larger
batch size when integrated with vLLM. This results in faster NER evaluation.

Fixed
- Move models to the device before running any inference with it, as this causes issues
when flash attention is enabled.
- When benchmarking instruction tuned models, we now ensure that generation stops when
the end-of-chat token is reached (such as `<|im_end|>` and `[/INST]`). This had a
negative performance impact on question answering and summarization, but the
remaining tasks were not affected.

12.3.2

Fixed
- There is an issue with the underlying `outlines` package that we use for structured
generation, where many of the generations stop prematurely when the batch is too
large. We fix this temporarily by lowering the batch size from the entire dataset to
the standard 32 when vLLM is used for NER tasks. This will be changed back when the
bug is fixed. Follow the progress in [this `outlines`
issue](https://github.com/outlines-dev/outlines/issues/757).
- Issue when checking if the `openai` extra needed to be installed, or when the
`OPENAI_API_KEY` needs to be set.
- Setting `add_prefix_space=False` caused an error during the loading of some
tokenizers. To fix this, we only supply the `add_prefix_space` keyword argument
during the loading of the tokenizer if it is True.

12.3.1

Fixed
- An issue with Pydantic typing, causing initialisation of `Benchmarker` to throw an
error.

12.3.0

Changed
- Updated `outlines` dependency to `>=0.0.36,<0.1`. This fixes a race condition caused
during evaluation of NER datasets and also includes integration with the
`transformers` library. The existing hardcoded integration has now been removed in
favour of the integration in that package.

12.2.1

Not secure

Fixed
- Now includes the `transformers` integration with `outlines` directly in the code,
which caused issues as they weren't part of the newest `outlines` release. When it
does get included then we will import these as before.
- When evaluating OpenAI models we now do not perform any structured generation, as we
do not have access to the logits.

Page 4 of 21

Releases

Has known vulnerabilities

Previous Next

Scandeval

Page 4 of 21

12.5.0

12.4.0

12.3.2

12.3.1

12.3.0

12.2.1

Page 4 of 21

Links

Releases