Added
- Added the separate `nb` (Norwegian Bokmål) and `nn` (Norwegian Nynorsk)
language tags, on top of the general `no` (Norwegian).
- Added more multilingual models.
Fixed
- SpaCy models was evaluated wrongly on the `dane-no-misc` dataset, as their
`MISC` predictions was not replaced with `O` tags.
- When evaluating models finetuned for token classification on a text
classification task, a `ValueError` was raised, rather than an
`InvalidBenchmark` exception.
- If none of the model's labels are among the dataset's labels, and are not
even synonyms of them, then raise an `InvalidBenchmark`. This prevents things
like evaluating a finetuned sentiment model on a NER task.
- When `evaluate_train` was `True`, this previously evaluated the test set
instead.
Changed
- Changed `Benchmark` API. Now the constructor and the `__call__` method have
the same arguments, except the `model_id` and `dataset` in `__call__`, where
the constructor sets the default values and the `__call__` method can change
these to specific cases.
- Changed the benchmarking order. Now benchmarks all datasets for a model,
before moving on to the next model
- Renamed the `multilabel` argument to the more descriptive `two_labels`.
- Updated docstrings to be more accurate.
- Early stopping patience is now set to `2 + 250 // len(train)`, so that
smaller datasets can enjoy a bit more patience, but if the dataset contains
at least 250 samples then it will remain at the current 2 patience.
Removed
- Removed `learning_rate`, `batch_size`, `warmup_steps` and `num_finetunings`
arguments from the benchmarks. These are now fixed to 2e-5, 32, 25% of the
training dataset and 10, respectively. Note that the batch size will still
automatically decrease if the GPU runs out of memory.