Lm-eval

Latest version: v0.4.8

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 2

0.4.7

This release includes several bug fixes, minor improvements to model handling, and task additions.

⚠️ Python 3.8 End of Support Notice
Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.

Backwards Incompatibilities

0.4.6

This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.

Backwards Incompatibilities

Chat Template Delimiter Handling

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to [docs/chat-template-readme.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/chat-template-readme.md)

New Benchmarks & Tasks

Multilingual Expansion
- **Spanish Bench**: Enhanced benchmark with additional tasks by zxcvuser in 2390
- **Japanese Leaderboard**: New comprehensive Japanese language benchmark by sitfoxfly in 2439

New Task Collections
- **Multimodal Unitext**: Added support for multimodal tasks available in unitext by elronbandel in 2364
- **Metabench**: New benchmark contributed by kozzy97 in 2357

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (baberabb and lintangsutawika)

What's Changed
* Add Unitxt Multimodality Support by elronbandel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2364
* Add new tasks to spanish_bench and fix duplicates by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2390
* fix typo bug for minerva_math by renjie-ranger in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
* Fix: Turkish MMLU Regex Pattern by ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2393
* fix storycloze datanames by t1101675 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
* Update NoticIA prompt by ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2421
* [Fix] Replace generic exception classes with a more specific ones by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1989
* Support for IBM watsonx_llm by Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
* Fix package extras for watsonx support by kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
* Fix lora requests when dp with vllm by ckgresla in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
* Add xquad task by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2435
* Add verify_certificate argument to local-completion by sjmonson in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
* Add GPTQModel support for evaluating GPTQ models by Qubitium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
* Add missing task links by Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
* Update CODEOWNERS by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2453
* Add real process_docs example by Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2456
* Modify label errors in catcola and paws-x by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
* Add Japanese Leaderboard by sitfoxfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
* Typos: Fix 'loglikelihood' misspellings in api_models.py by RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
* use global `multi_choice_filter` for mmlu_flan by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2461
* typo by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2465
* pass device_map other than auto for parallelize by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2457
* OpenAI ChatCompletions: switch `max_tokens` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2443
* Ifeval: Dowload `punkt_tab` on rank 0 by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2267
* Fix chat template; fix leaderboard math by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2475
* change warning to debug by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2481
* Updated wandb logger to use `new_printer()` instead of `get_printer(...)` by alex-titterton in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
* IBM watsonx_llm fixes & refactor by Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2464
* Fix revision parameter to vllm get_tokenizer by OyvindTafjord in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
* update pre-commit hooks and git actions by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2497
* kbl-v0.1.1 by whwang299 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
* Add mamba hf to `mamba_ssm` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2496
* remove duplicate `arc_ca` tag by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2499
* Add metabench task to LM Evaluation Harness by kozzy97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357
* Nits by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2500
* [API models] parse tokenizer_backend=None properly by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2509

New Contributors
* renjie-ranger made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
* t1101675 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
* Medokins made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
* kiersten-stokes made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
* ckgresla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
* sjmonson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
* Qubitium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
* Sypherd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
* sitfoxfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
* RobGeada made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
* alex-titterton made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
* OyvindTafjord made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
* whwang299 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
* kozzy97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.5...v0.4.6

0.4.5

New Additions

Prototype Support for Vision Language Models (VLMs)

We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types `hf-multimodal` and `vllm-vlm`. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (`mmmu_val`) task and we welcome contributions and feedback from the community!

New VLM-Specific Arguments

VLM models can be configured with several new arguments within `--model_args` to support their specific requirements:

- `max_images` (int): Set the maximum number of images for each prompt.
- `interleave` (bool): Determines the positioning of image inputs. When `True` (default) images are interleaved with the text. When `False` all images are placed at the front of the text. This is model dependent.

`hf-multimodal` specific args:
- `image_token_id` (int) or `image_string` (str): Specifies a custom token or string for image placeholders. For example, Llava models expect an `"<image>"` string to indicate the location of images in the input, while Qwen2-VL models expect an `"<|image_pad|>"` sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model family
- `convert_img_format` (bool): Whether to convert the images to RGB format.

Example usage:

- `lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template`

- `lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template`

Important considerations

1. **Chat Template**: Most VLMs require the `--apply_chat_template` flag to ensure proper input formatting according to the model's expected chat template.
2. Some VLM models are limited to processing a single image per prompt. For these models, always set `max_images=1`. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiring `interleave=False`.
3. Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.

Tested VLM Models

We have currently most notably tested the implementation with the following models:

- llava-hf/llava-1.5-7b-hf
- llava-hf/llava-v1.6-mistral-7b-hf
- Qwen/Qwen2-VL-2B-Instruct
- HuggingFaceM4/idefics2 (requires the latest `transformers` from source)

New Tasks

Several new tasks have been contributed to the library for this version!

New tasks as of v0.4.5 include:
- Open Arabic LLM Leaderboard tasks, contributed by shahrzads Malikeh97 in 2232
- **MMMU (validation set), by haileyschoelkopf baberabb lintangsutawika in 2243**
- TurkishMMLU by ArdaYueksel in 2283
- PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in 2153 2154 2155 2156 2157 by zxcvuser and others

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Backwards Incompatibilities

Finalizing `group` versus `tag` split

We've now fully deprecated the use of `group` keys directly within a task's configuration file. The appropriate key to use is now solely `tag` for many cases. See the [v0.4.4 patchnotes](https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.4) for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.

Handling of Causal vs. Seq2seq backend in HFLM

In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for `self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM`. Some users may want to use causal model behavior, but set `self.AUTO_MODEL_CLASS` to a different factory class, such as `transformers.AutoModelForVision2Seq`.

As a result, those users who subclass HFLM but do not call `HFLM.__init__()` may now also need to set the `self.backend` attribute to either `"causal"` or `"seq2seq"` during initialization themselves.

While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see https://github.com/EleutherAI/lm-evaluation-harness/pull/2353 for the full set of changes.

Future Plans

We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!

Thanks, the LM Eval Harness team (baberabb haileyschoelkopf lintangsutawika)

What's Changed
* Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by Malikeh97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2232
* Multimodal prototyping by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2243
* Update README.md by SYusupov in https://github.com/EleutherAI/lm-evaluation-harness/pull/2297
* remove comma by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2315
* Update neuron backend by dacorvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2314
* Fixed dummy model by Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/2339
* Add a note for missing dependencies by eldarkurtic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2336
* squad v2: load metric with `evaluate` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2351
* fix writeout script by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2350
* Treat tags in python tasks the same as yaml tasks by giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2288
* change group to tags in task `eus_exams` task configs by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2320
* change glianorex to test split by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2332
* mmlu-pro: add newlines to task descriptions (not leaderboard) by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2334
* Added TurkishMMLU to LM Evaluation Harness by ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2283
* add mmlu readme by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2282
* openai: better error messages; fix greedy matching by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2327
* fix some bugs of mmlu by eyuansu62 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2299
* Add new benchmark: Portuguese bench by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2156
* Fix missing key in custom task loading. by giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2304
* Add new benchmark: Spanish bench by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2157
* Add new benchmark: Galician bench by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2155
* Add new benchmark: Basque bench by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2153
* Add new benchmark: Catalan bench by zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2154
* fix tests by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2380
* Hotfix! by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2383
* Solution for CSAT-QA tasks evaluation by KyujinHan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2385
* LingOly - Fixing scoring bugs for smaller models by am-bean in https://github.com/EleutherAI/lm-evaluation-harness/pull/2376
* Fix float limit override by cjluo-omniml in https://github.com/EleutherAI/lm-evaluation-harness/pull/2325
* [API] tokenizer: add trust-remote-code by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2372
* HF: switch conditional checks to `self.backend` from `AUTO_MODEL_CLASS` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2353
* max_images are passed on to vllms `limit_mm_per_prompt` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2387
* Fix Llava-1.5-hf ; Update to version 0.4.5 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2388
* Bump version to v0.4.5 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2389

New Contributors
* Malikeh97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2232
* SYusupov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2297
* dacorvo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2314
* eldarkurtic made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2336
* giuliolovisotto made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2288
* ArdaYueksel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2283
* zxcvuser made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2156
* KyujinHan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2385
* cjluo-omniml made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2325

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.4...v0.4.5

0.4.4

New Additions

- This release includes the **Open LLM Leaderboard 2** official task implementations! These can be run by using `--tasks leaderboard`. Thank you to the HF team (clefourrier, NathanHB , KonradSzafer, lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release [here](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).

- **API support is overhauled!** Now: support for *concurrent requests*, chat templates, tokenization, *batching* and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the `base_url` to `--model_args`, for example, `base_url=http://localhost:8000/v1/completions`; concurrent requests are controlled with the `num_concurrent` argument; tokenization is controlled with `tokenized_requests`.
- Other arguments (such as top_p, top_k, etc.) can be passed to the API using `--gen_kwargs` as usual.
- Note: Instruct-tuned models, not just base models, can be used with `local-completions` using `--apply_chat_template` (either with or without `tokenized_requests`).
- They can also be used with `local-chat-completions` (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). **This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.**
- example with OpenAI completions API (using vllm serve):
- `lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu`
- example with chat API:
- `lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k`
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under `local-completions`!

- **We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks**. See **Backwards Incompatibilities** below for more information on changes and migration instructions.

- A combination of data-parallel and model-parallel (using HF's `device_map` functionality for "naive" pipeline parallel) inference using `--model hf` is now supported, thank you to NathanHB and team!

Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!

New Tasks

A number of new tasks have been contributed to the library.

As a further discoverability improvement, `lm_eval --tasks list` now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to anthony-dipofi for working on this.

New tasks as of v0.4.4 include:
- Open LLM Leaderboard 2 tasks--see above!
- Inverse Scaling tasks, contributed by h-albert-lee in 1589
- Unitxt tasks reworked by elronbandel in 1933
- MMLU-SR, contributed by SkySuperCat in 2032
- IrokoBench, contributed by JessicaOjo IsraelAbebe in 2042
- MedConceptQA, contributed by Ofir408 in 2010
- MMLU Pro, contributed by ysjprojects in 1961
- GSM-Plus, contributed by ysjprojects in 2103
- Lingoly, contributed by am-bean in 2198
- GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by Cameron7195 in 2215 2236
- TMLU, contributed by adamlin120 in 2093
- Mela, contributed by Geralt-Targaryen in 1970

Backwards Incompatibilities

`tag`s versus `group`s, and how to migrate

Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like `mmlu` to aggregate and report a unified score across a set of component "subtasks".

There were two ways to add a task to a given `group` name: 1) to provide (a list of) values to the `group` field in a given subtask's config file:

yaml
this is a *task* yaml file.
group: group_name1
task: my_task1
rest of task config goes here...

or 2) to define a "group config file" and specify a group along with its constituent subtasks:

yaml
this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
...

These would both have the same effect of **reporting an averaged metric for group_name1** when calling `lm_eval --tasks group_name1`. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.

**We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a `tag` and `group` property separately!**

To register a *shorthand* (now called a **`tag`**), simply change the `group` field name within your task's config to be `tag` (`group_alias` keys will no longer be supported in task configs.):

yaml
this is a *task* yaml file.
tag: tag_name1
task: my_task1
rest of task config goes here...

Group config files may remain as is if aggregation is not desired. **To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file**:

yaml
this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
...
New! Needed to turn on aggregation
aggregate_metric_list:
- metric: acc placeholder. Note that all subtasks in this group must report an `acc` metric key
- weight_by_size: True whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.

Please see our documentation [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs) for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.

Future Plans

We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.

Thanks, the LM Eval Harness team (haileyschoelkopf lintangsutawika baberabb)

What's Changed
* fix wandb logger module import in example by ToluClassics in https://github.com/EleutherAI/lm-evaluation-harness/pull/2041
* Fix strip whitespace filter by NathanHB in https://github.com/EleutherAI/lm-evaluation-harness/pull/2048
* Gemma-2 also needs default `add_bos_token=True` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2049
* Update `trust_remote_code` for Hellaswag by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2029
* Adds Open LLM Leaderboard Taks by NathanHB in https://github.com/EleutherAI/lm-evaluation-harness/pull/2047
* 1442 inverse scaling tasks implementation by h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1589
* Fix TypeError in samplers.py by converting int to str by uni2237 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2074
* Group agg rework by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1741
* Fix printout tests (N/A expected for stderrs) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2080
* Easier unitxt tasks loading and removal of unitxt library dependancy by elronbandel in https://github.com/EleutherAI/lm-evaluation-harness/pull/1933
* Allow gating EvaluationTracker HF Hub results; customizability by NathanHB in https://github.com/EleutherAI/lm-evaluation-harness/pull/2051
* Minor doc fix: leaderboard README.md missing mmlu-pro group and task by pankajarm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2075
* Revert missing utf-8 encoding for logged sample files (2027) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2082
* Update utils.py by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2085
* batch_size may be str if 'auto' is specified by meg-huggingface in https://github.com/EleutherAI/lm-evaluation-harness/pull/2084
* Prettify lm_eval --tasks list by anthony-dipofi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1929
* Suppress noisy RougeScorer logs in `truthfulqa_gen` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2090
* Update default.yaml by waneon in https://github.com/EleutherAI/lm-evaluation-harness/pull/2092
* Add new dataset MMLU-SR tasks by SkySuperCat in https://github.com/EleutherAI/lm-evaluation-harness/pull/2032
* Irokobench: Benchmark Dataset for African languages by JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2042
* docs: remove trailing sentence from contribution doc by nathan-weinberg in https://github.com/EleutherAI/lm-evaluation-harness/pull/2098
* Added MedConceptsQA Benchmark by Ofir408 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2010
* Also force BOS for `"recurrent_gemma"` and other Gemma model types by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2105
* formatting by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2104
* docs: align local test command to match CI by nathan-weinberg in https://github.com/EleutherAI/lm-evaluation-harness/pull/2100
* Fixed colon in Belebele _default_template_yaml by jab13x in https://github.com/EleutherAI/lm-evaluation-harness/pull/2111
* Fix haerae task groups by jungwhank in https://github.com/EleutherAI/lm-evaluation-harness/pull/2112
* fix: broken discord link in CONTRIBUTING.md by nathan-weinberg in https://github.com/EleutherAI/lm-evaluation-harness/pull/2114
* docs: update truthfulqa tasks by CandiedCode in https://github.com/EleutherAI/lm-evaluation-harness/pull/2119
* Hotfix `lm_eval.caching` module by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2124
* Refactor API models by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2008
* bugfix and docs for API by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2139
* [Bugfix] add temperature=0 to logprobs and seed args to API models by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2149
* refactor: limit usage of `scipy` and `skilearn` dependencies by nathan-weinberg in https://github.com/EleutherAI/lm-evaluation-harness/pull/2097
* Update lm-eval-overview.ipynb by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2118
* fix typo. by kargaranamir in https://github.com/EleutherAI/lm-evaluation-harness/pull/2169
* Incorrect URL by zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/2125
* Dp and mp support by NathanHB in https://github.com/EleutherAI/lm-evaluation-harness/pull/2056
* [hotfix] API: messages were created twice by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2174
* add okapi machine translated notice. by kargaranamir in https://github.com/EleutherAI/lm-evaluation-harness/pull/2168
* IrokoBench: Fix incorrect group assignments by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2181
* Mmlu Pro by ysjprojects in https://github.com/EleutherAI/lm-evaluation-harness/pull/1961
* added gsm_plus by ysjprojects in https://github.com/EleutherAI/lm-evaluation-harness/pull/2103
* Fix `revision` kwarg dtype in edge-cases by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2184
* Small README tweaks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2186
* gsm_plus minor fix by ysjprojects in https://github.com/EleutherAI/lm-evaluation-harness/pull/2191
* keep new line for task description by jungwhank in https://github.com/EleutherAI/lm-evaluation-harness/pull/2116
* Update README.md by ysjprojects in https://github.com/EleutherAI/lm-evaluation-harness/pull/2206
* Update citation in README.md by antonpolishko in https://github.com/EleutherAI/lm-evaluation-harness/pull/2083
* New task: Lingoly by am-bean in https://github.com/EleutherAI/lm-evaluation-harness/pull/2198
* Created a new task for gsm8k which corresponds to the Llama cot settings… by Cameron7195 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2215
* Lingoly README update by am-bean in https://github.com/EleutherAI/lm-evaluation-harness/pull/2228
* Update yaml to adapt to belebele dataset changes by Uminosachi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2216
* Add TMLU Benchmark Dataset by adamlin120 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2093
* Update IFEval dataset to official one by lewtun in https://github.com/EleutherAI/lm-evaluation-harness/pull/2218
* fix the leaderboard doc to reflect the tasks by NathanHB in https://github.com/EleutherAI/lm-evaluation-harness/pull/2219
* Add multiple chat template by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/2129
* Update CODEOWNERS by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2229
* Fix Zeno Visualizer by namtranase in https://github.com/EleutherAI/lm-evaluation-harness/pull/2227
* mela by Geralt-Targaryen in https://github.com/EleutherAI/lm-evaluation-harness/pull/1970
* fix the regex string in mmlu_pro template by lxning in https://github.com/EleutherAI/lm-evaluation-harness/pull/2238
* Fix logging when resizing embedding layer in peft mode by WPoelman in https://github.com/EleutherAI/lm-evaluation-harness/pull/2239
* fix mmlu_pro typo by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2241
* Fix typos in multiple places by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/2244
* fix group args of mmlu and mmlu_pro by eyuansu62 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2245
* Created new task for testing Llama on Asdiv by Cameron7195 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2236
* chat template hotfix by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2250
* [Draft] More descriptive `simple_evaluate()` LM TypeError by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2258
* Update NLTK version in `*ifeval` tasks ( 2210 ) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2259
* Fix `loglikelihood_rolling` caching ( 1821 ) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2187
* API: fix maxlen; vllm: prefix_token_id bug by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2262
* hotfix 2262 by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2264
* Chat Template fix (cont. 2235) by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2269
* Bump version to v0.4.4 ; Fixes to TMMLUplus by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2280

New Contributors
* ToluClassics made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2041
* NathanHB made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2048
* uni2237 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2074
* elronbandel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1933
* pankajarm made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2075
* meg-huggingface made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2084
* waneon made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2092
* SkySuperCat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2032
* JessicaOjo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2042
* nathan-weinberg made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2098
* Ofir408 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2010
* jab13x made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2111
* jungwhank made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2112
* CandiedCode made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2119
* kargaranamir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2169
* ysjprojects made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1961
* antonpolishko made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2083
* am-bean made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2198
* Cameron7195 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2215
* Uminosachi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2216
* adamlin120 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2093
* lewtun made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2218
* namtranase made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2227
* Geralt-Targaryen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1970
* lxning made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2238
* WPoelman made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2239
* eyuansu62 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2245

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.3...v0.4.4

0.4.3

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested **Chat Templating**, contributed by KonradSzafer clefourrier NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with `--apply_chat_template` and a system prompt of your choosing using `--system_instruction "my sysprompt here"`. The `--fewshot_as_multiturn` flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

- Logging results to the HF Hub if desired using `--hf_hub_log_args`, by KonradSzafer and team!

- NeMo model support by sergiopperez !
- Anthropic Chat API support by tryuman !
- DeepSparse and SparseML model types by mgoin !

- Handling of delta-weights in HF models, by KonradSzafer !
- LoRA support for VLLM, by bcicc !
- Fixes to PEFT modules which add new tokens to the embedding layers, by mapmeld !

- Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by djstrong !
- The use of custom `Sampler` subclasses in tasks, by LSinev !
- The ability to specify "hardcoded" few-shot examples more cleanly, by clefourrier !

- Support for Ascend NPUs (`--device npu`) by statelesshz, zhabuye, jiaqiw09 and others!
- Logging of `higher_is_better` in results tables for clearer understanding of eval metrics by zafstojano !

- extra info logged about models, including info about tokenizers, chat templating, and more, by artemorloff djstrong and others!

- Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given `lm_eval/tasks` subfolder, for further info on each task contained within a given folder. Thank you to AnthonyDipofi Harryalways317 nairbv sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:
- ACLUE, a benchmark for Ancient Chinese understanding, by haonan-li
- BasqueGlue and EusExams, two Basque-language tasks by juletx
- TMMLU+, an evaluation for Traditional Chinese, contributed by ZoneTwelve
- XNLIeu, a Basque version of XNLI, by juletx
- Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by mukobi
- FDA, SWDE, and Squad-Completion zero-shot tasks by simran-arora and team
- Added back the `hendrycks_math` task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
- COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by Erland366
- tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by LucWeber and team!
- Glianorex, a benchmark for testing performance on fictional medical questions, by maximegmd
- New FLD (formal logic) task variants by MorishT
- Improved translations of Lambada Multilingual tasks, added by zafstojano
- NoticIA, a Spanish summarization dataset by ikergarcia1996
- The Paloma perplexity benchmark, added by zafstojano
- We've removed the AMMLU dataset due to concerns about auto-translation quality.
- Added the *localized*, not translated, ArabicMMLU dataset, contributed by Yazeed7 !
- BertaQA, a Basque cultural knowledge benchmark, by juletx
- New machine-translated ARC-C datasets by jonabur !
- CommonsenseQA, in a prompt format following Llama, by murphybrendan
- ...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to
- `{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json` if `--output_path` is set, and
- `{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl` for each task's samples if `--log_samples` is set.

e.g. `outputs/gpt2/results_2024-06-28T00-00-00.00001.json` and `outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl`.

See https://github.com/EleutherAI/lm-evaluation-harness/pull/1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

- The official **Open LLM Leaderboard 2** tasks will be landing soon in the Eval Harness main branch and subsequently in `v0.4.4` on PyPI!

- The fact that `group`s of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between `group`s of tasks that *do* report aggregate scores (think `mmlu`) versus `tag`s which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the `pythia` grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).

- We'd also like to improve the API model support in the Eval Harness from its current state.

- More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, haileyschoelkopf lintangsutawika

What's Changed
* use BOS token in loglikelihood by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1588
* Revert "Patch for Seq2Seq Model predictions" by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1601
* fix gen_kwargs arg reading by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1607
* fix until arg processing by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1608
* Fixes to Loglikelihood prefix token / VLLM by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1611
* Add ACLUE task by haonan-li in https://github.com/EleutherAI/lm-evaluation-harness/pull/1614
* OpenAI Completions -- fix passing of unexpected 'until' arg by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1612
* add logging of model args by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1619
* Add vLLM FAQs to README (1625) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1633
* peft Version Assertion by LameloBally in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
* Seq2seq fix by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1604
* Integration of NeMo models into LM Evaluation Harness library by sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
* Fix conditional import for Nemo LM class by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1641
* Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by orsharir in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
* Add Latxa paper evaluation tasks for Basque by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1654
* Fix CLI --batch_size arg for openai-completions/local-completions by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1656
* Patch QQP prompt (1648 ) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1661
* TMMLU+ implementation by ZoneTwelve in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
* Anthropic Chat API by tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
* correction bug EleutherAI1664 by nicho2 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
* Signpost potential bugs / unsupported ops in MPS backend by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1680
* Add delta weights model loading by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
* Add `neuralmagic` models for `sparseml` and `deepsparse` by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1674
* Improvements to run NVIDIA NeMo models on LM Evaluation Harness by sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1699
* Adding retries and rate limit to toxicity tasks by sator-labs in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
* reference `--tasks list` in README by nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1726
* Add XNLIeu: a dataset for cross-lingual NLI in Basque by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1694
* Fix Parameter Propagation for Tasks that have `include` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1749
* Support individual scrolls datasets by giorgossideris in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
* Add filter registry decorator by lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
* remove duplicated `num_fewshot: 0` by chujiezheng in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
* Pile 10k new task by mukobi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
* Fix m_arc choices by jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1760
* upload new tasks by simran-arora in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
* vllm lora support by bcicc in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
* Add option to set OpenVINO config by helena-intel in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
* evaluation tracker implementation by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1766
* eval tracker args fix by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1777
* limit fix by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1785
* remove echo parameter in OpenAI completions API by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1779
* Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` by MuhammadBinUsman03 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
* Fix bug in setting until kwarg in openai completions by ciaranby in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
* Provide ability for custom sampler for ConfigurableTask by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1616
* Update `--tasks list` option in interface documentation by sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
* Fix Caching Tests ; Remove `pretrained=gpt2` default by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1775
* link to the example output on the hub by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1798
* Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1793
* Logging Updates (Alphabetize table printouts, fix eval tracker bug) (1774) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1791
* Initial integration of the Unitxt to LM eval harness by yoavkatz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
* add task for mmlu evaluation in arc multiple choice format by jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1745
* Update flag `--hf_hub_log_args` in interface documentation by sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1806
* Copal task by Erland366 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
* Adding tinyBenchmarks datasets by LucWeber in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
* interface doc update by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1807
* Fix links in README guiding to another branch by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1838
* Fix: support PEFT/LoRA with added tokens by mapmeld in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
* Fix incorrect check for task type by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
* Fixing typos in `docs` by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1863
* Update polemo2_out.yaml by zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
* Unpin vllm in dependencies by edgan8 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
* Fix outdated links to the latest links in `docs` by oneonlee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
* [HFLM]Use Accelerate's API to reduce hard-coded CUDA code by statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
* Fix `batch_size=auto` for HF Seq2Seq models (1765) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1790
* Fix Brier Score by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1847
* Fix for bootstrap_iters = 0 case (1715) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1789
* add mmlu tasks from pile-t5 by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1710
* Bigbench fix by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1686
* Rename `lm_eval.logging -> lm_eval.loggers` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1858
* Updated vllm imports in vllm_causallms.py by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1890
* [HFLM]Add support for Ascend NPU by statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1886
* `higher_is_better` tickers in output table by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1893
* Add dataset card when pushing to HF hub by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1898
* Making hardcoded few shots compatible with the chat template mechanism by clefourrier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
* Try to make existing tests run little bit faster by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1905
* Fix fewshot seed only set when overriding num_fewshot by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1914
* Complete task list from pr 1727 by anthony-dipofi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1901
* Add chat template by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1873
* Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data by maximegmd in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
* Modify pre-commit hook to check merge conflicts accidentally committed by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1927
* [add] fld logical formula task by MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1931
* Add new Lambada translations by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1897
* Implement NoticIA by ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
* Add The Arabic version of the PICA benchmark by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1917
* Fix social_iqa answer choices by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1909
* Update basque-glue by zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1913
* Test output table layout consistency by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1916
* Fix a tiny typo in `__main__.py` by sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
* Add the Arabic version with refactor to Arabic pica to be in alghafa … by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1940
* Results filenames handling fix by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1926
* Remove AMMLU Due to Translation by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1948
* Add option in TaskManager to not index library default tasks ; Tests for include_path by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1856
* Force BOS token usage in 'gemma' models for VLLM by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1857
* Fix a tiny typo in `docs/interface.md` by sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1955
* Fix self.max_tokens in anthropic_llms.py by lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1848
* `samples` is newline delimited by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1930
* Fix `--gen_kwargs` and VLLM (`temperature` not respected) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1800
* Make `scripts.write_out` error out when no splits match by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1796
* fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' by johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
* add trust_remote_code for piqa by changwangss in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
* Fix self assignment in neuron_optimum.py by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1990
* [New Task] Add Paloma benchmark by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1928
* Fix Paloma Template yaml by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1993
* Log `fewshot_as_multiturn` in results files by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1995
* Added ArabicMMLU by Yazeed7 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
* Fix Datasets `--trust_remote_code` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1998
* Add BertaQA dataset tasks by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1964
* add tokenizer logs info by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1731
* Hotfix breaking import by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2015
* add arc_challenge_mt by jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1900
* Remove `LM` dependency from `build_all_requests` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2011
* Added CommonsenseQA task by murphybrendan in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
* Factor out LM-specific tests by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1859
* Update interface.md by johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1982
* Fix `trust_remote_code`-related test failures by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2024
* Fixes scrolls task bug with few_shot examples by xksteven in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003
* fix cache by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2037
* Add chat template to `vllm` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2034
* Fail gracefully upon tokenizer logging failure (2035) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2038
* Bundle `exact_match` HF Evaluate metric with install, don't call evaluate.load() on import by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2045
* Update package version to v0.4.3 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2046

New Contributors
* LameloBally made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
* sergiopperez made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
* orsharir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
* ZoneTwelve made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
* tryumanshow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
* nicho2 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
* KonradSzafer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
* sator-labs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
* giorgossideris made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
* lozhn made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
* chujiezheng made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
* mukobi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
* simran-arora made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
* bcicc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
* helena-intel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
* MuhammadBinUsman03 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
* ciaranby made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
* sepiatone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
* yoavkatz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
* Erland366 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
* LucWeber made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
* mapmeld made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
* zafstojano made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
* zhabuye made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
* edgan8 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
* oneonlee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
* statelesshz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
* clefourrier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
* maximegmd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
* ikergarcia1996 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
* sadra-barikbin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
* johnwee1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
* changwangss made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
* Yazeed7 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
* murphybrendan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
* xksteven made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.2...v0.4.3

0.4.2

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions
- Request Caching by inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
- Weights and Biases logging by ayulockin - evals can now be logged to both WandB and Zeno!
- New Tasks
- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by h-albert-lee guijinSON
- GPQA by uanu2002
- French Bench by ManuelFay
- EQ-Bench by pbevan1 and sqrkl
- HAERAE-Bench, readded by h-albert-lee
- Updates to answer parsing on many generative tasks (GSM8k, MGSM, BBH zeroshot) by thinknbtfly!
- Okapi (translated) Open LLM Leaderboard tasks by uanu2002 and giux78
- Arabic MMLU and aEXAMS by khalil-hennara
- And more!
- Re-introduction of `TemplateLM` base class for lower-code new LM class implementations by anjor
- Run the library with metrics/scoring stage skipped via `--predict_only` by baberabb
- Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:
`TaskManager` API

previously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks()
or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])

New intended usage:

import lm_eval

optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

`get_task_dict()` now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see 1390 1427 for more information.

As always, please feel free to give us feedback or request new features! We're grateful for the community's support.

What's Changed
* Add support for RWKV models with World tokenizer by PicoCreator in https://github.com/EleutherAI/lm-evaluation-harness/pull/1374
* add bypass metric by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1156
* Expand docs, update CITATION.bib by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1227
* Hf: minor egde cases by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1380
* Enable override of printed `n-shot` in table by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1379
* Faster Task and Group Loading, Allow Recursive Groups by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1321
* Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1384
* fix on --task list by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1387
* Support for Inf2 optimum class [WIP] by michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1364
* Update README.md by mycoalchen in https://github.com/EleutherAI/lm-evaluation-harness/pull/1398
* Fix confusing `write_out.py` instructions in README by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1371
* Use Pooled rather than Combined Variance for calculating stderr of task groupings by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1390
* adding hf_transfer by michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1400
* `batch_size` with `auto` defaults to 1 if `No executable batch size found` is raised by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1405
* Fix printing bug in 1390 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1414
* Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416 by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1418
* Fix watchdog timeout by JeevanBhoot in https://github.com/EleutherAI/lm-evaluation-harness/pull/1404
* Evaluate by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1385
* Add multilingual ARC task by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1419
* Add multilingual TruthfulQA task by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1420
* [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu by giux78 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1358
* Added seeds to `evaluator.simple_evaluate` signature by Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
* Fix: task weighting by subtask size ; update Pooled Stderr formula slightly by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1427
* Refactor utilities into a separate model utils file. by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1429
* Nit fix: Updated OpenBookQA Readme by adavidho in https://github.com/EleutherAI/lm-evaluation-harness/pull/1430
* improve hf_transfer activation by michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1438
* Correct typo in task name in ARC documentation by larekrow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1443
* update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) by thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1356
* Add a new task HaeRae-Bench by h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1445
* Group reqs by context by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1425
* Add a new task GPQA (the part without CoT) by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1434
* Added KMMLU evaluation method and changed ReadMe by h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1447
* Add TemplateLM boilerplate LM class by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1279
* Log which subtasks were called with which groups by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1456
* PR fixing the issue 1391 (wrong contexts in the mgsm task) by leocnj in https://github.com/EleutherAI/lm-evaluation-harness/pull/1440
* feat: Add Weights and Biases support by ayulockin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1339
* Fixed generation args issue affection OpenAI completion model by Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/1458
* update parsing logic of mgsm following gsm8k (mgsm en 0 -> 50%) by thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1462
* Adding documentation for Weights and Biases CLI interface by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1466
* Add environment and transformers version logging in results dump by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1464
* Apply code autoformatting with Ruff to tasks/*.py an *__init__.py by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1469
* Setting trust_remote_code to `True` for HuggingFace datasets compatibility by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1467
* add arabic mmlu by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1402
* Add Gemma support (Add flag to control BOS token usage) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1465
* Revert "Setting trust_remote_code to `True` for HuggingFace datasets compatibility" by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1474
* Create a means for caching task registration and request building. Ad… by inf3rnus in https://github.com/EleutherAI/lm-evaluation-harness/pull/1372
* Cont metrics by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1475
* Refactor `evaluater.evaluate` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1441
* add multilingual mmlu eval by jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1484
* Update TruthfulQA val split name by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1488
* Fix AttributeError in huggingface.py When 'model_type' is Missing by richwardle in https://github.com/EleutherAI/lm-evaluation-harness/pull/1489
* Fix duplicated kwargs in some model init by lchu-ibm in https://github.com/EleutherAI/lm-evaluation-harness/pull/1495
* Add multilingual truthfulqa targets by jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1499
* Always include EOS token as stop sequence by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1480
* Improve data-parallel request partitioning for VLLM by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1477
* modify `WandbLogger` to accept arbitrary kwargs by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1491
* Vllm update DP+TP by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1508
* Setting trust_remote_code to True for HuggingFace datasets compatibility by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1487
* Cleaning up unused unit tests by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1516
* French Bench by ManuelFay in https://github.com/EleutherAI/lm-evaluation-harness/pull/1500
* Hotfix: fix TypeError in `--trust_remote_code` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1517
* Fix minor edge cases (951 1503) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1520
* Openllm benchmark by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1526
* Add a new task GPQA (the part CoT and generative) by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1482
* Add EQ-Bench as per 1459 by pbevan1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1511
* Add WMDP Multiple-choice by justinphan3110 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1534
* Adding new task : KorMedMCQA by sean0042 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1530
* Update docs on LM.loglikelihood_rolling abstract method by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1532
* Minor KMMLU cleanup by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1502
* Cleanup and fixes (Task, Instance, and a little bit of *evaluate) by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1533
* Update installation commands in openai_completions.py and contributing document and, update wandb_args description by naem1023 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1536
* Add compatibility for vLLM's new Logprob object by Yard1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549
* Fix incorrect `max_gen_toks` generation kwarg default in code2_text. by cosmo3769 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1551
* Support jinja templating for task descriptions by HishamYahya in https://github.com/EleutherAI/lm-evaluation-harness/pull/1553
* Fix incorrect `max_gen_toks` generation kwarg default in generative Bigbench by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1546
* Hardcode IFEval to 0-shot by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1506
* add Arabic EXAMS benchmark by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1498
* AGIEval by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1359
* cli_evaluate calls simple_evaluate with the same verbosity. by Wongboo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1563
* add manual tqdm disabling management by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1569
* Fix README section on vllm integration by eitanturok in https://github.com/EleutherAI/lm-evaluation-harness/pull/1579
* Fix Jinja template for Advanced AI Risk by RylanSchaeffer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1587
* Proposed approach for testing CLI arg parsing by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1566
* Patch for Seq2Seq Model predictions by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1584
* Add start date in results.json by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1592
* Cleanup for v0.4.2 release by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1573
* Fix eval_logger import for mmlu/_generate_configs.py by noufmitla in https://github.com/EleutherAI/lm-evaluation-harness/pull/1593

New Contributors
* PicoCreator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1374
* michaelfeil made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1364
* mycoalchen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1398
* JeevanBhoot made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1404
* uanu2002 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1419
* giux78 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1358
* Am1n3e made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
* adavidho made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1430
* larekrow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1443
* leocnj made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1440
* ayulockin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1339
* khalil-Hennara made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1402
* inf3rnus made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1372
* jordane95 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1484
* richwardle made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1489
* lchu-ibm made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1495
* pbevan1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1511
* justinphan3110 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1534
* sean0042 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1530
* naem1023 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1536
* Yard1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549
* cosmo3769 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1551
* HishamYahya made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1553
* Wongboo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1563
* artemorloff made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1569
* eitanturok made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1579
* RylanSchaeffer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1587
* noufmitla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1593

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2

Page 1 of 2

Releases

Has known vulnerabilities

Lm-eval

Page 1 of 2

0.4.7

0.4.6

0.4.5

0.4.4

0.4.3

0.4.2

Page 1 of 2

Links

Releases