scandeval Changelog

8.1.0

Not secure

Added
- Now added support for text-to-text tasks, which include tasks such as abstractive
summarization, abstractive question-answering and translation. These can only be
benchmarked with generative models. In this release, this includes the following
datasets:
- `nordjylland-news`, a Danish summarization dataset based on news articles.
- `swedn`, a Swedish summarization dataset based on news articles.
- `no-sammendrag`, a Norwegian summarization dataset based on news articles.
- `rrn`, an Icelandic summarization dataset based on news articles.
- `mlsum`, a German summarization dataset based on news articles.
- `wiki-lingua-nl`, a Dutch summarization dataset based on WikiHow articles.
These are all of the task `summarization`, meaning that they can also all be run
using `scandeval --dataset-task summarization --model-id <model_id>`.
- A `--use-flash-attention` flag has been added, which enables Flash Attention 2.0,
which is required by some models, such as Mistral-based ones. If `flash-attn` has not
been installed then an informative error message will be raised. Thanks to
[peter-sk](https://github.com/peter-sk) for this contribution! :tada:

Changed
- Now uses 8-bit AdamW whenever CUDA is available, as opposed to regular AdamW.
Experiments shows that this does not affect benchmarking performance, but reduces
memory usage and thus allows benchmarking of larger models

Fixed
- A bug was removed which caused some overlap between the dataset splits of the
ScandiQA datasets.
- Now allows loading in models in the data type that they were trained in, which
previously caused errors if they weren't trained in float32.

8.0.0

Not secure

Added
- Support for few-shot evaluation of decoder models, both from the Hugging Face Hub and
OpenAI models. This currently happens automatically when specifying a generative
model from the Hugging Face Hub, and with all OpenAI models.
- Now stores model caches in separate directories, enabling parallel evaluations.
Thanks to [KennethEnevoldsen](https://github.com/KennethEnevoldsen) for this
contribution! :tada:
- Added `--device` argument to the CLI, which can be used to overwrite the automatic
detection of device (CPU, CUDA GPU, MPS GPU, TPU) to use.
- Added `--trust-remote-code/--no-trust-remote-code` argument to the CLI, as some
models require this flag to be loaded. It defaults to `False` for security reasons,
however.
- Added `--load-in-4bit/--no-load-in-4bit` argument to the CLI, which can be used to
overwrite the automatic 4bit loading of models. By default only generative models
will be loaded in 4bit, and only if a CUDA GPU is available, as this is required by
the underlying `bitsandbytes` package.
- Now manually adjusts the maximum sequence length of a model to ensure that the
reported maximum length is correct.

Changed
- Now only supports Python 3.10 and above.
- Changed the variation in the speed benchmark. Rather than using a fixed length
document and computing iterations per second, it now uses varied length documents and
computes tokens per second. This also has the added benefit of being able to better
compare models with varying level of maximum sequence lengths. Further, it now uses
GPU rather than CPU to accomodate 4-bit models, as these cannot be run on CPU.
- Changed the `--model-framework` argument to `--framework`.
- Changed the `--use-auth-token` and `--auth-token` arguments to `--use-token` and
`--token`, reflecting the same change in the `transformers` package.
- Now reports all model parameters, rather than just the trainable ones.
- Now uses 8-bit AdamW optimizer when CUDA is available rather than the default AdamW,
to save memory when working with larger models.

Removed
- Previously generative models had their maximum sequence length altered by subtracting
their padding token ID. This is not needed anymore and have been removed.

Fixed
- Handles timeouts better now, when fetching models from the Hugging Face Hub. Instead
of simply throwing the error, cancelling the benchmarking process, it simply tries
again until the connection is up again.
- Some models output both logits and hidden states, which caused unnecessary
out-of-memory issues. This is now handled using the `preprocess_logits_for_metrics`
argument in `Trainer`.
- Now catches errors while loading model configurations.

7.1.1

Not secure

Fixed
- The feature names of the NER datasets have been changed, so the code have been
updated to reflect this.

7.1.0

Not secure

Added
- Added support for the NorBERT3 models.

7.0.0

Not secure

Changed
- Now uses PyTorch 2.0, which (among other things) includes more control over the MPS.
This means that MPS out of memory errors will now be caught and dealt with like CUDA
out of memory errors, and we clear the MPS cache in between runs.

Fixed
- Ensure that `type_vocab_size` is not changed if it was previously set to 0. This
caused issues for some models when benchmarking question answering tasks.

6.3.0

Not secure

Added
- Now added support for benchmarking local models in the Hugging Face format (i.e.,
saved with the `save_pretrained` method). This automatically detects the framework
based on the file extension, but can also be set using the new `--model-framework`
argument. Thanks to [peter-sk](https://github.com/peter-sk) for implementing this!
:tada:

Fixed
- Now handles word-token alignment properly with SentencePiece tokenisers, which caused
some models not being able to be benchmarked on token classification tasks.
- Now handles UNK tokens during word-token alignment, where it locates the word that is
being tokenised into the UNK token, extracting the original value of the UNK token
and replacing the token by that value.

Scandeval

Page 8 of 21

8.1.0

8.0.0

7.1.1

7.1.0

7.0.0

6.3.0

Page 8 of 21

Links

Releases