Scandeval

Latest version: v12.10.8

Safety actively analyzes 642283 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 19 of 21

0.9.0

Not secure
Added
- Added the separate `nb` (Norwegian Bokmål) and `nn` (Norwegian Nynorsk)
language tags, on top of the general `no` (Norwegian).
- Added more multilingual models.

Fixed
- SpaCy models was evaluated wrongly on the `dane-no-misc` dataset, as their
`MISC` predictions was not replaced with `O` tags.
- When evaluating models finetuned for token classification on a text
classification task, a `ValueError` was raised, rather than an
`InvalidBenchmark` exception.
- If none of the model's labels are among the dataset's labels, and are not
even synonyms of them, then raise an `InvalidBenchmark`. This prevents things
like evaluating a finetuned sentiment model on a NER task.
- When `evaluate_train` was `True`, this previously evaluated the test set
instead.

Changed
- Changed `Benchmark` API. Now the constructor and the `__call__` method have
the same arguments, except the `model_id` and `dataset` in `__call__`, where
the constructor sets the default values and the `__call__` method can change
these to specific cases.
- Changed the benchmarking order. Now benchmarks all datasets for a model,
before moving on to the next model
- Renamed the `multilabel` argument to the more descriptive `two_labels`.
- Updated docstrings to be more accurate.
- Early stopping patience is now set to `2 + 250 // len(train)`, so that
smaller datasets can enjoy a bit more patience, but if the dataset contains
at least 250 samples then it will remain at the current 2 patience.

Removed
- Removed `learning_rate`, `batch_size`, `warmup_steps` and `num_finetunings`
arguments from the benchmarks. These are now fixed to 2e-5, 32, 25% of the
training dataset and 10, respectively. Note that the batch size will still
automatically decrease if the GPU runs out of memory.

0.8.0

Not secure
Changed
- Models are now being trained for much longer, but with an early stopping
callback with patience 2. This will enable a more uniform comparison between
models that require a different number of finetuning epochs.

Fixed
- There was a bug when evaluating a finetuned PyTorch model on a sequence
classification task, if the model had only been trained on a proper subset of
the labels present in the dataset.

Removed
- All individual benchmarks have been removed from `__init__.py`. They can
still be imported using their individual modules, for instance
`from scandeval.dane import DaneBenchmark`, but the idea is to use the
general `Benchmark` class instead.

0.7.0

Not secure
Changed
- Always ensure that a model can deal with the labels in the dataset when
finetuning. If the model has not been trained on the label, then this will
result in the model always getting that label wrong. For instance, this is
the case for finetuned NER models not having been trained on MISC tags, if
they are being evaluated on the DaNE dataset.

Fixed
- Fixed bug when evaluating SpaCy models.
- Only removing objects at memory cleanup if they exist at all.

0.6.0

Not secure
Added
- When finetuning models, 10% of the training data is used to evaluate the
models, which is used to choose the best performing model across all the
epochs trained. This will allow for a more fair comparison, as some models
degrade over time, while other models need a longer time to train.

Changed
- Uniformised the `_log_metrics` method for all benchmarks, now only defined in
`BaseBenchmark`.

Fixed
- Garbage collects when downsizing batch size, to not keep all the previous
models in memory.
- Typos in logging.

0.5.2

Not secure
Fixed
- Fixed bug when `evaluate_train` was set to False.

0.5.1

Not secure
Fixed
- The bootstrapping of the datasets is now done properly. Previously the
bootstrapped datasets were not converted to HuggingFace Dataset objects.

Page 19 of 21

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.