scandeval Changelog

6.1.0

Not secure

Added
- Added model inference speed estimation benchmark. This can now be run by setting
either `task` or `dataset` to "speed". E.g., `scandeval -m <model_id> -d speed` or
`scandeval -m <model_id> -dt speed`. This runs 10 iterations of 100 model inferences
on a document of length 2,600 (the document "This is a dummy document. " repeated 100
times). The inference speed includes tokenization, and is powered by the `pyinfer`
package.

6.0.1

Not secure

Fixed
- Added prefix space to DeBERTa models.
- Now automatically changes a model's `type_vocab_size` to at least 2 when benchmarking
the model on question-answering tasks. This previously caused an error when a model
config had it set to 1.

6.0.0

Not secure

Added
- Added support for decoder models such as the GPT-series.
- Added new Swedish sentiment classification dataset, SweReC, which is not
aspect-based, contrary to the previous ABSAbank-Imm dataset. This dataset is a
three-way classification task into the classical `positive`, `neutral` and `negative`
classes, thereby establishing uniformity between the sentiment classification
datasets in the different languages. The dataset comes from reviews from both
se.trustpilot.com and reco.se, and has been created by Kristoffer Svensson as part of
his Bachelor thesis "Sentiment Analysis With Convolutional Neural Networks:
Classifying sentiment in Swedish reviews".
- Added historic BERT models from `dbmdz` as part of the default multilingual list.
- Added the `--batch-size` argument, which can be used to manually select a batch size.
Must be among 1, 2, 4, 8, 16 and 32.

Removed
- As SweReC is a drop-in replacement for ABSAbank-Imm, the latter has been removed from
the ScandEval benchmark.

Fixed
- Now deals with an issue with DeBERTaV2 models where `pooler_hidden_size` has been set
to a value different to `hidden_size` in its configuration, which made it impossible
to do sequence classification with the model. The former is now forced to be the same
as the latter, fixing the issue.
- Now ensures that tokenizers, model configurations and metrics are cached to the
ScandEval cache, rather than the default Hugging Face cache.
- Previously, if a model's context length was greater than 1,000 it would be reduced to
512, since an unset context length results in a very large `model_max_length` value
of the tokenizer. This conflicted with longformer-style models whose context length
_actually_ was greater than 1,000, so now this upper bound has been increased to
100,000.
- Now includes `sacremoses` as a dependency, as this is required by some tokenizers.
- Converted the `id` column in ScandiQA to a string, to avoid integer overflow errors
during preprocessing.
- If there is a `torch` operation which does not have a deterministic component, then a
warning will be issued instead of raising an error.

5.0.0

Not secure

Added
- A new argument, `ignore_duplicates` (or `--ignore-duplicates/--no-ignore-duplicates`
in the CLI) further ignores an evaluation if it has previously been evaluated. This
argument defaults to `True`.
- Now stores the task and the dataset languages to the evaluation file with each
evaluation.
- Now stores model metadata to the `scandeval_benchmark_results` file. Currently, this
includes the number of trainable model parameters, the size of the model's vocabulary
and the model's maximum sequence length.

Changed
- Evaluation results are now saved in a JSONL file instead of a JSON file, and results
are appended onto the file after every evaluation.
- You can now specify your Hugging Face authentication token in the `use_auth_token`
argument of `Benchmarker` rather than manually logging in with `huggingface-cli
login`. In the CLI an authentication token can also be applied directly using the new
`--auth-token` argument. If an authentication is provided in this way in the CLI,
then there is no need to add the `--use-auth-token` flag.
- The "random" models have now been renamed to "fresh", to emphasise that they are not
random, but instead randomly initialized.
- The fresh models are now task independent, meaning that `fresh-xlmr-base` will now
adapt to the task at hand, rather than having to benchmark, e.g.,
`fresh-xlmr-base-sequence-clf` and `fresh-xlmr-base-token-clf` separately.

Fixed
- ScandEval now works on TPUs.
- Removed `bf16` precision, as it only works for some GPUs.
- Should output less `transformers` logging now.
- Models were previously loaded in twice in the beginning of a benchmark. They are now
only loaded in once (but re-loaded during each of the 10 iterations to ensure that we
are starting from the same point).
- Changed the model architecture of the `fresh-xlmr-base` from `Roberta` to
`XLMRoberta`.
- The `--dataset-task` is now correctly filtering the datasets benchmarked.
- Some tokenizers are not adding special tokens, despite them having registered them.
These are now manually added, to ensure a proper evaluation of the models.

Removed
- Removed support for evaluating finetuned models, as the package was primarily used to
benchmark pretrained models anyway, and the change in datasets means that many
finetuned models would have been trained on (part of) the test sets, resulting in
artificially large scores. For evaluation of finetuned models, please check out the
`aiai_eval` Python package instead.

4.0.2

Not secure

Fixed
- Now garbage collects properly, where previously (from v4 onwards) the `model` and
`model_dict` were not removed from memory after each run, potentially causing a
memory leak.

Added
- Added the `HuggingFaceHubDown` and `NoInternetConnection` exceptions, to give more
information to the user when benchmarking fails.
- Added unit tests.

4.0.1

Not secure

Fixed
- Removed temporary printing of scores for each iteration.

Scandeval

Page 10 of 21

6.1.0

6.0.1

6.0.0

5.0.0

4.0.2

4.0.1

Page 10 of 21

Links

Releases