Fixed
- Now recognises the metadata for the new GPT-4o models correctly. Currently there is a
version clash between `vllm` and `tiktoken`, meaning that one needs to manually
upgrade `tiktoken` to evaluate GPT-4o - an informative error message notes this to
the user now in that case.
- Number of generated tokens for sequence classification tasks has been changed back to
1 (from 3). This makes no difference to open source models, as we only use the
logprobs from the first token anyway, but this makes a big difference on multiple
choice QA tasks for OpenAI models, as some of them might output things like "a is
correct" rather than simply "a". Since we're using word edit distance to the labels,
this might accidentally cause the final prediction to be different from "a".
- An error in `outlines<=0.0.36` meant that NER evaluations were near-random.
Unfortunately, due to a strict `outlines` requirement in `vllm`, we cannot enforce
`outlines>0.0.37` (see [this vLLM PR for a future
fix](https://github.com/vllm-project/vllm/pull/4109)). For now, to prevent faulty
evaluations, we raise an error, asking the user to manually upgrade `outlines` if
they have an old version.