We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.
New Additions
The big new feature is the often-requested **Chat Templating**, contributed by KonradSzafer clefourrier NathanHB and also worked on by a number of other awesome contributors!
You can now run using a chat template with `--apply_chat_template` and a system prompt of your choosing using `--system_instruction "my sysprompt here"`. The `--fewshot_as_multiturn` flag can control whether each few-shot example in context is a new conversational turn or not.
This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.
There's a lot more to check out, including:
- Logging results to the HF Hub if desired using `--hf_hub_log_args`, by KonradSzafer and team!
- NeMo model support by sergiopperez !
- Anthropic Chat API support by tryuman !
- DeepSparse and SparseML model types by mgoin !
- Handling of delta-weights in HF models, by KonradSzafer !
- LoRA support for VLLM, by bcicc !
- Fixes to PEFT modules which add new tokens to the embedding layers, by mapmeld !
- Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by djstrong !
- The use of custom `Sampler` subclasses in tasks, by LSinev !
- The ability to specify "hardcoded" few-shot examples more cleanly, by clefourrier !
- Support for Ascend NPUs (`--device npu`) by statelesshz, zhabuye, jiaqiw09 and others!
- Logging of `higher_is_better` in results tables for clearer understanding of eval metrics by zafstojano !
- extra info logged about models, including info about tokenizers, chat templating, and more, by artemorloff djstrong and others!
- Miscellaneous bug fixes! And many more great contributions we weren't able to list here.
New Tasks
We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given `lm_eval/tasks` subfolder, for further info on each task contained within a given folder. Thank you to AnthonyDipofi Harryalways317 nairbv sepiatone and others for working on this and giving feedback!
Without further ado, the tasks:
- ACLUE, a benchmark for Ancient Chinese understanding, by haonan-li
- BasqueGlue and EusExams, two Basque-language tasks by juletx
- TMMLU+, an evaluation for Traditional Chinese, contributed by ZoneTwelve
- XNLIeu, a Basque version of XNLI, by juletx
- Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by mukobi
- FDA, SWDE, and Squad-Completion zero-shot tasks by simran-arora and team
- Added back the `hendrycks_math` task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
- COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by Erland366
- tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by LucWeber and team!
- Glianorex, a benchmark for testing performance on fictional medical questions, by maximegmd
- New FLD (formal logic) task variants by MorishT
- Improved translations of Lambada Multilingual tasks, added by zafstojano
- NoticIA, a Spanish summarization dataset by ikergarcia1996
- The Paloma perplexity benchmark, added by zafstojano
- We've removed the AMMLU dataset due to concerns about auto-translation quality.
- Added the *localized*, not translated, ArabicMMLU dataset, contributed by Yazeed7 !
- BertaQA, a Basque cultural knowledge benchmark, by juletx
- New machine-translated ARC-C datasets by jonabur !
- CommonsenseQA, in a prompt format following Llama, by murphybrendan
- ...
Backwards Incompatibilities
The save format for logged results has now changed.
output files will now be written to
- `{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json` if `--output_path` is set, and
- `{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl` for each task's samples if `--log_samples` is set.
e.g. `outputs/gpt2/results_2024-06-28T00-00-00.00001.json` and `outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl`.
See https://github.com/EleutherAI/lm-evaluation-harness/pull/1926 for utilities which may help to work with these new filenames.
Future Plans
In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!
- The official **Open LLM Leaderboard 2** tasks will be landing soon in the Eval Harness main branch and subsequently in `v0.4.4` on PyPI!
- The fact that `group`s of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between `group`s of tasks that *do* report aggregate scores (think `mmlu`) versus `tag`s which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the `pythia` grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).
- We'd also like to improve the API model support in the Eval Harness from its current state.
- More to come!
Thank you to everyone who's contributed to or used the library!
Thanks, haileyschoelkopf lintangsutawika
What's Changed
* use BOS token in loglikelihood by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1588
* Revert "Patch for Seq2Seq Model predictions" by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1601
* fix gen_kwargs arg reading by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1607
* fix until arg processing by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1608
* Fixes to Loglikelihood prefix token / VLLM by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1611
* Add ACLUE task by haonan-li in https://github.com/EleutherAI/lm-evaluation-harness/pull/1614
* OpenAI Completions -- fix passing of unexpected 'until' arg by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1612
* add logging of model args by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1619
* Add vLLM FAQs to README (1625) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1633
* peft Version Assertion by LameloBally in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
* Seq2seq fix by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1604
* Integration of NeMo models into LM Evaluation Harness library by sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
* Fix conditional import for Nemo LM class by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1641
* Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by orsharir in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
* Add Latxa paper evaluation tasks for Basque by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1654
* Fix CLI --batch_size arg for openai-completions/local-completions by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1656
* Patch QQP prompt (1648 ) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1661
* TMMLU+ implementation by ZoneTwelve in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
* Anthropic Chat API by tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
* correction bug EleutherAI1664 by nicho2 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
* Signpost potential bugs / unsupported ops in MPS backend by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1680
* Add delta weights model loading by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
* Add `neuralmagic` models for `sparseml` and `deepsparse` by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1674
* Improvements to run NVIDIA NeMo models on LM Evaluation Harness by sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1699
* Adding retries and rate limit to toxicity tasks by sator-labs in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
* reference `--tasks list` in README by nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1726
* Add XNLIeu: a dataset for cross-lingual NLI in Basque by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1694
* Fix Parameter Propagation for Tasks that have `include` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1749
* Support individual scrolls datasets by giorgossideris in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
* Add filter registry decorator by lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
* remove duplicated `num_fewshot: 0` by chujiezheng in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
* Pile 10k new task by mukobi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
* Fix m_arc choices by jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1760
* upload new tasks by simran-arora in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
* vllm lora support by bcicc in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
* Add option to set OpenVINO config by helena-intel in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
* evaluation tracker implementation by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1766
* eval tracker args fix by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1777
* limit fix by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1785
* remove echo parameter in OpenAI completions API by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1779
* Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` by MuhammadBinUsman03 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
* Fix bug in setting until kwarg in openai completions by ciaranby in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
* Provide ability for custom sampler for ConfigurableTask by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1616
* Update `--tasks list` option in interface documentation by sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
* Fix Caching Tests ; Remove `pretrained=gpt2` default by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1775
* link to the example output on the hub by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1798
* Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1793
* Logging Updates (Alphabetize table printouts, fix eval tracker bug) (1774) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1791
* Initial integration of the Unitxt to LM eval harness by yoavkatz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
* add task for mmlu evaluation in arc multiple choice format by jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1745
* Update flag `--hf_hub_log_args` in interface documentation by sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1806
* Copal task by Erland366 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
* Adding tinyBenchmarks datasets by LucWeber in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
* interface doc update by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1807
* Fix links in README guiding to another branch by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1838
* Fix: support PEFT/LoRA with added tokens by mapmeld in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
* Fix incorrect check for task type by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
* Fixing typos in `docs` by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1863
* Update polemo2_out.yaml by zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
* Unpin vllm in dependencies by edgan8 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
* Fix outdated links to the latest links in `docs` by oneonlee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
* [HFLM]Use Accelerate's API to reduce hard-coded CUDA code by statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
* Fix `batch_size=auto` for HF Seq2Seq models (1765) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1790
* Fix Brier Score by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1847
* Fix for bootstrap_iters = 0 case (1715) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1789
* add mmlu tasks from pile-t5 by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1710
* Bigbench fix by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1686
* Rename `lm_eval.logging -> lm_eval.loggers` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1858
* Updated vllm imports in vllm_causallms.py by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1890
* [HFLM]Add support for Ascend NPU by statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1886
* `higher_is_better` tickers in output table by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1893
* Add dataset card when pushing to HF hub by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1898
* Making hardcoded few shots compatible with the chat template mechanism by clefourrier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
* Try to make existing tests run little bit faster by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1905
* Fix fewshot seed only set when overriding num_fewshot by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1914
* Complete task list from pr 1727 by anthony-dipofi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1901
* Add chat template by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1873
* Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data by maximegmd in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
* Modify pre-commit hook to check merge conflicts accidentally committed by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1927
* [add] fld logical formula task by MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1931
* Add new Lambada translations by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1897
* Implement NoticIA by ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
* Add The Arabic version of the PICA benchmark by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1917
* Fix social_iqa answer choices by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1909
* Update basque-glue by zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1913
* Test output table layout consistency by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1916
* Fix a tiny typo in `__main__.py` by sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
* Add the Arabic version with refactor to Arabic pica to be in alghafa … by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1940
* Results filenames handling fix by KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1926
* Remove AMMLU Due to Translation by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1948
* Add option in TaskManager to not index library default tasks ; Tests for include_path by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1856
* Force BOS token usage in 'gemma' models for VLLM by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1857
* Fix a tiny typo in `docs/interface.md` by sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1955
* Fix self.max_tokens in anthropic_llms.py by lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1848
* `samples` is newline delimited by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1930
* Fix `--gen_kwargs` and VLLM (`temperature` not respected) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1800
* Make `scripts.write_out` error out when no splits match by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1796
* fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' by johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
* add trust_remote_code for piqa by changwangss in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
* Fix self assignment in neuron_optimum.py by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1990
* [New Task] Add Paloma benchmark by zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1928
* Fix Paloma Template yaml by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1993
* Log `fewshot_as_multiturn` in results files by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1995
* Added ArabicMMLU by Yazeed7 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
* Fix Datasets `--trust_remote_code` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1998
* Add BertaQA dataset tasks by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1964
* add tokenizer logs info by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1731
* Hotfix breaking import by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2015
* add arc_challenge_mt by jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1900
* Remove `LM` dependency from `build_all_requests` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2011
* Added CommonsenseQA task by murphybrendan in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
* Factor out LM-specific tests by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1859
* Update interface.md by johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1982
* Fix `trust_remote_code`-related test failures by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2024
* Fixes scrolls task bug with few_shot examples by xksteven in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003
* fix cache by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2037
* Add chat template to `vllm` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2034
* Fail gracefully upon tokenizer logging failure (2035) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2038
* Bundle `exact_match` HF Evaluate metric with install, don't call evaluate.load() on import by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2045
* Update package version to v0.4.3 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2046
New Contributors
* LameloBally made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
* sergiopperez made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
* orsharir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
* ZoneTwelve made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
* tryumanshow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
* nicho2 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
* KonradSzafer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
* sator-labs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
* giorgossideris made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
* lozhn made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
* chujiezheng made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
* mukobi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
* simran-arora made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
* bcicc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
* helena-intel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
* MuhammadBinUsman03 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
* ciaranby made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
* sepiatone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
* yoavkatz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
* Erland366 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
* LucWeber made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
* mapmeld made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
* zafstojano made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
* zhabuye made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
* edgan8 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
* oneonlee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
* statelesshz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
* clefourrier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
* maximegmd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
* ikergarcia1996 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
* sadra-barikbin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
* johnwee1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
* changwangss made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
* Yazeed7 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
* murphybrendan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
* xksteven made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003
**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.2...v0.4.3