Added
- Added arguments to `Benchmarker.benchmark` (or simply `Benchmarker.__call_`),
corresponding to the same arguments during initialisation. The idea here is that the
default parameters are set during initialisation, and then any of these can be
changed if needed when performing a concrete evaluation, without having to
re-initialise the `Benchmarker`.
- Added the Danish knowledge datasets `danske-talemaader` and `danish-citizen-tests`.
Both are multiple choice datasets, where the first one tests knowledge about Danish
idioms, and the second one tests knowledge about the Danish society. These replace
the machine translated MMLU-da dataset.
- Added a `--num-iterations` flag (`num_iterations` in the Python CLI), which controls
the number of times each model should be evaluated, defaulting to the usual 10
iterations. This is only meant to be changed for power users, and if it is changed
then the resulting scores will not be included in the leaderboards.
Changed
- The default value of the languages are now all languages, rather than only Danish,
Swedish and Norwegian.
- Changed all summarisation datasets to use one few-shot example (some were set to 2),
and increased the maximum amount of generated tokens to 256 rather than the previous
128, since many of the gold standard summaries are around 200 tokens.
Fixed
- There was an error caused if an old version of the `openai` package was installed and
if the `scandeval` package was checking if a model exists as an OpenAI model. Now an
informative error is thrown if the model is not found on any available platforms, as
well as noting the extras that are missing, which prevents the package from checking
existence on those platforms.
- Changed the prompt for the English sentiment classification dataset SST5, where it
previously stated that the documents were tweets - these have now been renamed to
"texts".
- Correctly assess whether the `openai` extra should be used, which made it impossible
to benchmark OpenAI models.
- Disabled `lmformatenforcer` logging, which happens in the rare case when we're
few-shot evaluating a model on NER and there are no JSON-valid tokens to generate.
Removed
- Removed all machine translated ARC datasets, as they had a near 100% correlation with
the machine translated version of the MMLU datasets.