Fixed
- Changed vLLM inference parameters to limit the GPU memory usage during evaluation,
which makes it possible to evaluate larger models on the same hardware as previously.
Concretely, the `gpu_memory_utilization` has been raised from 0.9 to 0.95,
`enforce_eager` is set to True, the `max_model_len` has been reduced from (at most)
10,000 to (at most) 5,000. See [this
issue](https://github.com/ScandEval/ScandEval/issues/383) for an overview of maximum
amount of tokens in each dataset (as of v12.6.0 of ScandEval).
- Removed 1 sample from the Swedish sentiment classification dataset SweReC which was
abnormally long, to keep the maximum amount of tokens in the samples below 5,000.
Replaced the outlier sample with a new one.
- The number of allowed generated tokens for the Danish summarisation dataset
Nordjylland News was mistakenly set to 128, compared to 256 for all other
summarisation datasets. This has been fixed now.
- Now correctly detects if `autoawq` should be installed, when evaluating an AWQ model.
- Reduced `transformers` dependency to `4.38.x` again, as `autoawq` requires this.
- Do not use BitsAndBytes quantisation if the model is already quantised.