Scandeval

Latest version: v12.10.8

Safety actively analyzes 642283 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 20 of 21

0.5.0

Not secure
Added
- It is possible to only evaluate on the test sets, to save some time. This can
be done in the `Benchmark` class using the `evaluate_train` argument, and in
the CLI with the `--evaluate_train` flag.
- Added `progress_bar` argument to `Benchmark` to control whether progress bars
should be shown, and added the `no_progress_bar` flag to the CLI for the same
reason.

Changed
- Updated `epochs` and `warmup_steps` of all the datasets to something more
reasonable, enabling better comparisons of the finetuned models.
- Changed calculation of confidence intervals, which is now based on
bootstrapping rather than the analytic approach. It will now evaluate ten
times on the test set and compute a bootstrap estimate of the standard error,
which is uses to compute an interval around the score on the entire test set.

0.4.3

Not secure
Fixed
- RuntimeErrors occuring during training will now raise an `InvalidBenchmark`
exception, which means that the CLI and the `Benchmark` class will skip it.
This is for instance caused when `max_length` has not been specified in the
model config, meaning that the tokeniser does not know how much to truncate.

0.4.2

Not secure
Fixed
- Now catching the error where tokenisation is not possible, due to the model
having been trained on a different task than what is present in the dataset.
E.g., if a generator model is trained on a classification task.

0.4.1

Not secure
Fixed
- Now catching the error when the model's config does not align with the model
class. When using the CLI or `Benchmark`, these will be skipped.

0.4.0

Not secure
Added
- Added confidence intervals for finetuned models, where there is a 95%
likelihood that the true score would belong to the interval, given infinite
data from the same distribution. In the case of "raw" pretrained models, this
radius is added onto the existing interval, so that both the uncertainty in
model initialisation as well as sample size of the validation dataset affects
the size of the interval.
- Added garbage collection after each benchmark, which will (hopefully) prevent
memory leaking when benchmarking several models.

Changed
- New logo, including the Faroe Islands!
- Allow the possibility to include all languages and/or tasks in the CLI and
the `Benchmark` class.
- Added Icelandic and Faroese to default list of languages in CLI and the
`Benchmark` class.
- The default value for `task` is now all tasks, which also includes models
that haven't been assigned any task on the HuggingFace Hub;
- If a model cannot be trained without running out of CUDA memory, even with a
batch size of 1, then the model will be skipped in `Benchmark` and the CLI.

Fixed
- New model is initialised if CUDA runs out of memory, to ensure that we are
now continuing to train the previous model.
- Dependency parsing now implemented properly as two-label classification, with
associated UAS and LAS metric computations. Works for pretrained SpaCy models
as well as finetuning general language models.

0.3.1

Not secure
Fixed
- Reduces batch size if CUDA runs out of memory during evaluation.
- Loading of text classification datasets now working properly.

Page 20 of 21

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.