Scandeval

Latest version: v12.10.8

Safety actively analyzes 642295 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 7 of 21

9.1.0

Not secure
Changed
- Now only stores the top-10 log probabilities of generated tokens when the generation
length is less than 8 tokens. Also now keeps separate caches for each (model,
dataset) combination, where it previously had a single cache for each model. Both of
these help reduce the memory usage of the model output cache.
- Optimised cache saving/loading a bit, making the waiting time in between iterations
slightly shorter.
- Removes the model output cache for a (model, dataset) combination when the
benchmarking of the model on the dataset finishes successfully. Also removed indents
in model output cache JSON files. Both of these help reducing the disk space used on
caching.

Fixed
- Only require generative models to output logprobs if the dataset is of a task that
requires it. This caused the benchmarking to use excessive memory when benchmarking
datasets that require long generative outputs, such as NER.

Removed
- Removed some vLLM logging.

9.0.0

Not secure
Added
- Now caches the completions of open source generative models, which effectively makes
benchmarking of these ~33% faster. We cannot store all logits for storage reasons (it
quickly gets >100GB in that case), so we instead store the top-100 logits for each
generated token, but only if the generated sequence is shorter than 50 tokens. We
thus assume that (a) these are the only logits needed, and (b) that the generations
don't change. We argue that (a) is the case since we only use the logits in
classification tasks, in which case we only use the first token anyway. Further,
since we're using a temperature of 0 anyway, the generations will be as close to
deterministic as possible (up to small rounding fluctuations of logits, which is
negligible). This is a breaking change, since it is not compatible with the previous
way we cached OpenAI model outputs.
- Added a new `--clear-model-cache` flag, which removes the cached models after
finishing the benchmarking of each model, to save disk space. This doesn't remove the
cached model outputs or datasets.
- Added the following new datasets:
- `fone`, a Faroese NER dataset, which replaces the previous `wikiann-fo` dataset.
- `dansk`, a Danish NER dataset, which replaces the previous `dane` dataset.
- `norquad`, a Norwegian question answering dataset, which replaces the previous
`scandiqa-no` dataset.
- Danish, Swedish, German and Dutch versions of the MMLU, ARC and HellaSwag
datasets, testing knowledge and common sense reasoning of generative models.
These have been machine translated by the University of Oregon using
GPT-3.5-turbo. Machine translation is not adequate, of course, so see this as a
first version of these kinds of evaluations, to get some benchmarks going asap.
- `squad-nl`, a Dutch extract question answering dataset, which is a machine
translated version of SQuAD-v2. As with the datasets mentioned above, this is
meant as a first version of a Dutch QA dataset, until we have a better one
available.
- Added `--only-validation-split` flag, which only benchmarks the model on the
validation split, which is 5-10x smaller than the test split (depending on the
dataset). This is especially useful with paid models like OpenAI models. The value of
this flag is stored in the benchmark results, so this will be visible on
leaderboards.
- Now uses vLLM as the underlying engine for few-shot evaluating generative models,
which drastically improves the evaluation speed, as well as requiring less GPU
memory.

Changed
- Now compatible with`transformers >= 4.36.2`, and this is required now as they have
changed their generation API in a breaking manner.
- Now removes all newlines from texts in the summarization task, where previously these
were merely "squashed" to single newlines. This makes the separation of few-shot
examples for generative models easier.
- Also removes newlines from the NER task, where these were not removed at all
previously.
- Now doesn't force ASCII characters in the NER task for generative models, making the
target JSON dictionary more consistent with the input text.
- If a model is stored in the Safetensors format on Hugging Face Hub, then we read out
the number of parameters directly from those files. This results in more accurate
parameter counts as opposed to loading in the model in 4-bit and counting manually.
- Samples with excessively short or long texts have been removed.
- Adjusted number of few-shot examples in datasets to ensure that the resulting prompt
is at most ~3000 tokens long.
- When timeout errors occur when loading a model then we will try again at most 5 times
now, where previously we would attempt to re-load it indefinitely.

Fixed
- Removed `text2text-generation` temporarily from the tags defining generative models,
since we do not support the benchmarking of these yet. This will be added back in as
soon as we support them.
- Now catches `OSError`s when loading Hugging Face model configurations, which happen
when there is no `config.json` file in the model repo.
- When sampling few-shot examples for question answering tasks we previously sampled
among examples with context length less than 1024 characters, to keep the prompt
short. This is too small for some datasets, so now we dynamically set this threshold
based on the dataset itself, starting from 512 and doubling until we have at least
the number of desired few-shot examples to choose from.
- Now only sets `torch_dtype` is CUDA is available, as otherwise errors are caused.
- Previously text generation in a batch would be stopped if any of the samples in the
batch reached the stopping criteria, causing a lot of incomplete completions. Now
the model continues to generate text until the entire batch is complete, and the
excess generation is removed afterwards.
- When benchmarking encoder models on QA tasks the contexts are split up if they exceed
the model's context length. The stride value used caused errors in rare cases where
the model's maximum context length was really small (128). This has been fixed now.
- Now sets `ignore_mismatched_sizes` when loading models if the model cannot be loaded
otherwise. This previously caused some issues when loading certain models.
- Fixed bug where some encoder models did not work properly when loaded in with FP16
mixed precision due to overflow. We now load in models with BF16 as these have a
larger range, but fall back to FP16 if BF16 is not available. If both lead to
overflow then we attempt again with full FP32, and lastly throw an informative error
and block evaluation if the overflow persists.
- When few-shot evaluating models on NER tasks, we are now more lenient towards the
generated model output. Instead of taking the output as-is, we are now extracting the
first dictionary (enclosed in curly brackets), as well as replacing all single
apostrophes (') with double ones (").
- If a model is already pre-quantized then we will not attempt to quantize it as well.

8.2.1

Not secure
Fixed
- Removed the non-existent IsReC, FoReC and FoQA datasets.

8.2.0

Not secure
Added
- Added the following new datasets:
- `sb10k`, a German sentiment classification dataset.
- `dutch-social`, a Dutch sentiment classification dataset.
- `sst5`, an English sentiment classification dataset.
- `germeval`, a German NER dataset.
- `conll-nl`, a Dutch NER dataset.
- `conll-en`, an English NER dataset.
- `scala-de`, a German linguistic acceptability dataset.
- `scala-nl`, a Dutch linguistic acceptability dataset.
- `scala-en`, an English linguistic acceptability dataset.
- `nqii`, an Icelandic extractive question answering dataset.
- `germanquad`, a German extractive question answering dataset.
- `squad`, an English extractive question answering dataset.
- `cnn-dailymail`, an English summarization dataset.

Fixed
- Fixed bug with question answering benchmarking when the answer was a proper subset of
the first token in the context, causing errors when benchmarking some models.
- Some models have been stored in mixed precision as well as containing an
implementation of layer normalisation which is incompatible with such mixed
precision. When loading models we now only load in mixed precision if `torch_dtype`
has been specified in the Hugging Face model configuration (as with the Mistral
model, for instance).
- When sampling examples to use in few-shot prompts in a sequence classification, we
previously required that the samples are stratified with respect to the labels. This
caused an issue if the dataset did not contain all labels, so now we only stratify
with respect to the labels present in the dataset.
- When few-shot benchmarking on question answering datasets we previously only used the
samples whose contexts were at most 512 characters long. This turns out to be too few
for `germeval`, so this has been upped to 1024.

8.1.0

Not secure
Added
- Now added support for text-to-text tasks, which include tasks such as abstractive
summarization, abstractive question-answering and translation. These can only be
benchmarked with generative models. In this release, this includes the following
datasets:
- `nordjylland-news`, a Danish summarization dataset based on news articles.
- `swedn`, a Swedish summarization dataset based on news articles.
- `no-sammendrag`, a Norwegian summarization dataset based on news articles.
- `rrn`, an Icelandic summarization dataset based on news articles.
- `mlsum`, a German summarization dataset based on news articles.
- `wiki-lingua-nl`, a Dutch summarization dataset based on WikiHow articles.
These are all of the task `summarization`, meaning that they can also all be run
using `scandeval --dataset-task summarization --model-id <model_id>`.
- A `--use-flash-attention` flag has been added, which enables Flash Attention 2.0,
which is required by some models, such as Mistral-based ones. If `flash-attn` has not
been installed then an informative error message will be raised. Thanks to
[peter-sk](https://github.com/peter-sk) for this contribution! :tada:

Changed
- Now uses 8-bit AdamW whenever CUDA is available, as opposed to regular AdamW.
Experiments shows that this does not affect benchmarking performance, but reduces
memory usage and thus allows benchmarking of larger models

Fixed
- A bug was removed which caused some overlap between the dataset splits of the
ScandiQA datasets.
- Now allows loading in models in the data type that they were trained in, which
previously caused errors if they weren't trained in float32.

8.0.0

Not secure
Added
- Support for few-shot evaluation of decoder models, both from the Hugging Face Hub and
OpenAI models. This currently happens automatically when specifying a generative
model from the Hugging Face Hub, and with all OpenAI models.
- Now stores model caches in separate directories, enabling parallel evaluations.
Thanks to [KennethEnevoldsen](https://github.com/KennethEnevoldsen) for this
contribution! :tada:
- Added `--device` argument to the CLI, which can be used to overwrite the automatic
detection of device (CPU, CUDA GPU, MPS GPU, TPU) to use.
- Added `--trust-remote-code/--no-trust-remote-code` argument to the CLI, as some
models require this flag to be loaded. It defaults to `False` for security reasons,
however.
- Added `--load-in-4bit/--no-load-in-4bit` argument to the CLI, which can be used to
overwrite the automatic 4bit loading of models. By default only generative models
will be loaded in 4bit, and only if a CUDA GPU is available, as this is required by
the underlying `bitsandbytes` package.
- Now manually adjusts the maximum sequence length of a model to ensure that the
reported maximum length is correct.

Changed
- Now only supports Python 3.10 and above.
- Changed the variation in the speed benchmark. Rather than using a fixed length
document and computing iterations per second, it now uses varied length documents and
computes tokens per second. This also has the added benefit of being able to better
compare models with varying level of maximum sequence lengths. Further, it now uses
GPU rather than CPU to accomodate 4-bit models, as these cannot be run on CPU.
- Changed the `--model-framework` argument to `--framework`.
- Changed the `--use-auth-token` and `--auth-token` arguments to `--use-token` and
`--token`, reflecting the same change in the `transformers` package.
- Now reports all model parameters, rather than just the trainable ones.
- Now uses 8-bit AdamW optimizer when CUDA is available rather than the default AdamW,
to save memory when working with larger models.

Removed
- Previously generative models had their maximum sequence length altered by subtracting
their padding token ID. This is not needed anymore and have been removed.

Fixed
- Handles timeouts better now, when fetching models from the Hugging Face Hub. Instead
of simply throwing the error, cancelling the benchmarking process, it simply tries
again until the connection is up again.
- Some models output both logits and hidden states, which caused unnecessary
out-of-memory issues. This is now handled using the `preprocess_logits_for_metrics`
argument in `Trainer`.
- Now catches errors while loading model configurations.

Page 7 of 21

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.