Added
- Now caches the completions of open source generative models, which effectively makes
benchmarking of these ~33% faster. We cannot store all logits for storage reasons (it
quickly gets >100GB in that case), so we instead store the top-100 logits for each
generated token, but only if the generated sequence is shorter than 50 tokens. We
thus assume that (a) these are the only logits needed, and (b) that the generations
don't change. We argue that (a) is the case since we only use the logits in
classification tasks, in which case we only use the first token anyway. Further,
since we're using a temperature of 0 anyway, the generations will be as close to
deterministic as possible (up to small rounding fluctuations of logits, which is
negligible). This is a breaking change, since it is not compatible with the previous
way we cached OpenAI model outputs.
- Added a new `--clear-model-cache` flag, which removes the cached models after
finishing the benchmarking of each model, to save disk space. This doesn't remove the
cached model outputs or datasets.
- Added the following new datasets:
- `fone`, a Faroese NER dataset, which replaces the previous `wikiann-fo` dataset.
- `dansk`, a Danish NER dataset, which replaces the previous `dane` dataset.
- `norquad`, a Norwegian question answering dataset, which replaces the previous
`scandiqa-no` dataset.
- Danish, Swedish, German and Dutch versions of the MMLU, ARC and HellaSwag
datasets, testing knowledge and common sense reasoning of generative models.
These have been machine translated by the University of Oregon using
GPT-3.5-turbo. Machine translation is not adequate, of course, so see this as a
first version of these kinds of evaluations, to get some benchmarks going asap.
- `squad-nl`, a Dutch extract question answering dataset, which is a machine
translated version of SQuAD-v2. As with the datasets mentioned above, this is
meant as a first version of a Dutch QA dataset, until we have a better one
available.
- Added `--only-validation-split` flag, which only benchmarks the model on the
validation split, which is 5-10x smaller than the test split (depending on the
dataset). This is especially useful with paid models like OpenAI models. The value of
this flag is stored in the benchmark results, so this will be visible on
leaderboards.
- Now uses vLLM as the underlying engine for few-shot evaluating generative models,
which drastically improves the evaluation speed, as well as requiring less GPU
memory.
Changed
- Now compatible with`transformers >= 4.36.2`, and this is required now as they have
changed their generation API in a breaking manner.
- Now removes all newlines from texts in the summarization task, where previously these
were merely "squashed" to single newlines. This makes the separation of few-shot
examples for generative models easier.
- Also removes newlines from the NER task, where these were not removed at all
previously.
- Now doesn't force ASCII characters in the NER task for generative models, making the
target JSON dictionary more consistent with the input text.
- If a model is stored in the Safetensors format on Hugging Face Hub, then we read out
the number of parameters directly from those files. This results in more accurate
parameter counts as opposed to loading in the model in 4-bit and counting manually.
- Samples with excessively short or long texts have been removed.
- Adjusted number of few-shot examples in datasets to ensure that the resulting prompt
is at most ~3000 tokens long.
- When timeout errors occur when loading a model then we will try again at most 5 times
now, where previously we would attempt to re-load it indefinitely.
Fixed
- Removed `text2text-generation` temporarily from the tags defining generative models,
since we do not support the benchmarking of these yet. This will be added back in as
soon as we support them.
- Now catches `OSError`s when loading Hugging Face model configurations, which happen
when there is no `config.json` file in the model repo.
- When sampling few-shot examples for question answering tasks we previously sampled
among examples with context length less than 1024 characters, to keep the prompt
short. This is too small for some datasets, so now we dynamically set this threshold
based on the dataset itself, starting from 512 and doubling until we have at least
the number of desired few-shot examples to choose from.
- Now only sets `torch_dtype` is CUDA is available, as otherwise errors are caused.
- Previously text generation in a batch would be stopped if any of the samples in the
batch reached the stopping criteria, causing a lot of incomplete completions. Now
the model continues to generate text until the entire batch is complete, and the
excess generation is removed afterwards.
- When benchmarking encoder models on QA tasks the contexts are split up if they exceed
the model's context length. The stride value used caused errors in rare cases where
the model's maximum context length was really small (128). This has been fixed now.
- Now sets `ignore_mismatched_sizes` when loading models if the model cannot be loaded
otherwise. This previously caused some issues when loading certain models.
- Fixed bug where some encoder models did not work properly when loaded in with FP16
mixed precision due to overflow. We now load in models with BF16 as these have a
larger range, but fall back to FP16 if BF16 is not available. If both lead to
overflow then we attempt again with full FP32, and lastly throw an informative error
and block evaluation if the overflow persists.
- When few-shot evaluating models on NER tasks, we are now more lenient towards the
generated model output. Instead of taking the output as-is, we are now extracting the
first dictionary (enclosed in curly brackets), as well as replacing all single
apostrophes (') with double ones (").
- If a model is already pre-quantized then we will not attempt to quantize it as well.