Scandeval

Latest version: v12.10.8

Safety actively analyzes 642283 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 21

12.7.0

Added
- Added a script to evaluate human performance on datasets. This is a Gradio app which
can be run using the command `human_evaluate --annotator-id <id>`, where
`annotator-id` is the ID of the human annotator (from 0 to 10, inclusive). They will
then annotate their answers for validation splits from the iteration corresponding to
their annotator ID. All of the annotated results will be stored to
`scandeval_benchmark_results.jsonl`, as usual - note here that this will create a
single `human` entry, where multiple annotators will count as multiple iterations for
the same `human` model.

Fixed
- If a model has a very small maximal context length in its tokeniser configuration
then we ignore this value and instead use the default value.
- When a model is generative then we use default context length to be 32,768.
- Now ensures that we use mixed precision when CUDA is available, as this is required
by Flash Attention.
- By default we only use flash attention for generative models, as it leads to errors
with several encoder models.
- Add missing OpenAI models to the model cache, to checking model existence when no
OpenAI key is specified.
- Only imports from the `openai` package if it has been installed.
- Improved detection of the end-of-chat tokens for instruction tuned models, which
previously caused errors when evaluating some instruction tuned models.
- Loading of a pretrained model configuration from the Hugging Face Hub failed when the
model is gated and when the `cache_dir` is specified in `AutoConfig.from_pretrained`.
We now do not set that argument if the model is gated, as a temporary fix.

12.6.1

Fixed
- Changed vLLM inference parameters to limit the GPU memory usage during evaluation,
which makes it possible to evaluate larger models on the same hardware as previously.
Concretely, the `gpu_memory_utilization` has been raised from 0.9 to 0.95,
`enforce_eager` is set to True, the `max_model_len` has been reduced from (at most)
10,000 to (at most) 5,000. See [this
issue](https://github.com/ScandEval/ScandEval/issues/383) for an overview of maximum
amount of tokens in each dataset (as of v12.6.0 of ScandEval).
- Removed 1 sample from the Swedish sentiment classification dataset SweReC which was
abnormally long, to keep the maximum amount of tokens in the samples below 5,000.
Replaced the outlier sample with a new one.
- The number of allowed generated tokens for the Danish summarisation dataset
Nordjylland News was mistakenly set to 128, compared to 256 for all other
summarisation datasets. This has been fixed now.
- Now correctly detects if `autoawq` should be installed, when evaluating an AWQ model.
- Reduced `transformers` dependency to `4.38.x` again, as `autoawq` requires this.
- Do not use BitsAndBytes quantisation if the model is already quantised.

12.6.0

Changed
- Updated `transformers` dependency to `>=4.39.3,<4.40.0`.

Fixed
- Updated cached OpenAI model metadata.
- When loading local models we now more robustly detect the task of the model (i.e.,
whether it is a generative model, encoder model or sequence-to-sequence model). This
previously prevented evaluation of some local models.
- When detecting whether a local model exists, we now also look for the existence of
`*.safetensors` files.

12.5.3

Fixed
- The speed benchmark for OpenAI models was extremely slow, due to an issue with the
tokenizer. This has been fixed now.

12.5.2

Fixed
- Now using the same label order in the NER task as is in the dataset configuration.
From v12.1.0 and onwards these were updated to sorting the labels, but this has
resulted in significantly worse performance.
- Added GPT-4-turbo name variations to cached OpenAI model IDs. This means that we'll
be able to see if a model ID should be an OpenAI model, without an OpenAI API key.

12.5.1

Security
- Now uses an access token to access datasets, allowing the datasets to not be
publicly available on the Hugging Face Hub.

Page 3 of 21

Releases

Has known vulnerabilities

Previous Next

Scandeval

Page 3 of 21

12.7.0

12.6.1

12.6.0

12.5.3

12.5.2

12.5.1

Page 3 of 21

Links

Releases