Scandeval

Latest version: v13.0.0

Safety actively analyzes 666166 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 21

12.10.4

Fixed
- Access to the evaluation datasets were shut down by Hugging Face again. It has now
been restored.

12.10.3

Fixed
- Access to the evaluation datasets were shut down by Hugging Face. It has now been
restored.

12.10.2

Fixed
- Correctly update logits processors and prefix allowed functions tokens functions for
NER datasets when starting generation.
- We now use logprobs for OpenAI models, as this is supported by the chat models now.
This is used for all sequence classification based tasks, which currently comprise of
sentiment classification, linguistic acceptability, knowledge and common-sense
reasoning. This fixes some incorrect evaluations of the newer GPT-4-turbo and GPT-4o
models, as they tend to output things like "Sentiment: positive" rather than simply
"positive".

12.10.1

Fixed
- Now recognises the metadata for the new GPT-4o models correctly. Currently there is a
version clash between `vllm` and `tiktoken`, meaning that one needs to manually
upgrade `tiktoken` to evaluate GPT-4o - an informative error message notes this to
the user now in that case.
- Number of generated tokens for sequence classification tasks has been changed back to
1 (from 3). This makes no difference to open source models, as we only use the
logprobs from the first token anyway, but this makes a big difference on multiple
choice QA tasks for OpenAI models, as some of them might output things like "a is
correct" rather than simply "a". Since we're using word edit distance to the labels,
this might accidentally cause the final prediction to be different from "a".
- An error in `outlines<=0.0.36` meant that NER evaluations were near-random.
Unfortunately, due to a strict `outlines` requirement in `vllm`, we cannot enforce
`outlines>0.0.37` (see [this vLLM PR for a future
fix](https://github.com/vllm-project/vllm/pull/4109)). For now, to prevent faulty
evaluations, we raise an error, asking the user to manually upgrade `outlines` if
they have an old version.

12.10.0

Changed
- Update `autoawq` to `>=0.2.5,<0.3.0`, as it now doesn't have a dependency clash with
`transformers`.
- Update `vllm` to `>=0.4.2,<0.5.0`, to support new models (such as Phi-3).
- Update `torch` to `>=2.3.0,<3.0.0`, as this is required by `vllm`.

Fixed
- When overriding benchmark configuration parameters in `Benchmarker.benchmark` then
these overridden parameters are now correctly used when building datasets.
- When a generative model was benchmarked on a NER task followed by another task, the
structured generation wasn't set up correctly, as we're not re-initialising the model
since v12.8.0. We now ensure that the logits processors are re-built for every
dataset.

12.9.1

Fixed
- Disables the prefix caching of vLLMs, as it has not been implemented with sliding
window attention yet, causing re-initialisation errors.
- Updates `vllm` to `>=0.4.1,<0.5.0`, as this fixes an issue with benchmarking
freezing.

Page 2 of 21

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.