Lm-eval

Latest version: v0.4.2

Safety actively analyzes 623871 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.4.2

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions
- Request Caching by inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
- Weights and Biases logging by ayulockin - evals can now be logged to both WandB and Zeno!
- New Tasks
- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by h-albert-lee guijinSON
- GPQA by uanu2002
- French Bench by ManuelFay
- EQ-Bench by pbevan1 and sqrkl
- HAERAE-Bench, readded by h-albert-lee
- Updates to answer parsing on many generative tasks (GSM8k, MGSM, BBH zeroshot) by thinknbtfly!
- Okapi (translated) Open LLM Leaderboard tasks by uanu2002 and giux78
- Arabic MMLU and aEXAMS by khalil-hennara
- And more!
- Re-introduction of `TemplateLM` base class for lower-code new LM class implementations by anjor
- Run the library with metrics/scoring stage skipped via `--predict_only` by baberabb
- Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:
`TaskManager` API

previously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks()
or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")


lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])


New intended usage:

import lm_eval

optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

`get_task_dict()` now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see 1390 1427 for more information.



As always, please feel free to give us feedback or request new features! We're grateful for the community's support.


What's Changed
* Add support for RWKV models with World tokenizer by PicoCreator in https://github.com/EleutherAI/lm-evaluation-harness/pull/1374
* add bypass metric by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1156
* Expand docs, update CITATION.bib by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1227
* Hf: minor egde cases by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1380
* Enable override of printed `n-shot` in table by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1379
* Faster Task and Group Loading, Allow Recursive Groups by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1321
* Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1384
* fix on --task list by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1387
* Support for Inf2 optimum class [WIP] by michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1364
* Update README.md by mycoalchen in https://github.com/EleutherAI/lm-evaluation-harness/pull/1398
* Fix confusing `write_out.py` instructions in README by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1371
* Use Pooled rather than Combined Variance for calculating stderr of task groupings by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1390
* adding hf_transfer by michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1400
* `batch_size` with `auto` defaults to 1 if `No executable batch size found` is raised by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1405
* Fix printing bug in 1390 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1414
* Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416 by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1418
* Fix watchdog timeout by JeevanBhoot in https://github.com/EleutherAI/lm-evaluation-harness/pull/1404
* Evaluate by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1385
* Add multilingual ARC task by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1419
* Add multilingual TruthfulQA task by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1420
* [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu by giux78 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1358
* Added seeds to `evaluator.simple_evaluate` signature by Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
* Fix: task weighting by subtask size ; update Pooled Stderr formula slightly by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1427
* Refactor utilities into a separate model utils file. by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1429
* Nit fix: Updated OpenBookQA Readme by adavidho in https://github.com/EleutherAI/lm-evaluation-harness/pull/1430
* improve hf_transfer activation by michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1438
* Correct typo in task name in ARC documentation by larekrow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1443
* update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) by thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1356
* Add a new task HaeRae-Bench by h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1445
* Group reqs by context by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1425
* Add a new task GPQA (the part without CoT) by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1434
* Added KMMLU evaluation method and changed ReadMe by h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1447
* Add TemplateLM boilerplate LM class by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1279
* Log which subtasks were called with which groups by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1456
* PR fixing the issue 1391 (wrong contexts in the mgsm task) by leocnj in https://github.com/EleutherAI/lm-evaluation-harness/pull/1440
* feat: Add Weights and Biases support by ayulockin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1339
* Fixed generation args issue affection OpenAI completion model by Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/1458
* update parsing logic of mgsm following gsm8k (mgsm en 0 -> 50%) by thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1462
* Adding documentation for Weights and Biases CLI interface by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1466
* Add environment and transformers version logging in results dump by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1464
* Apply code autoformatting with Ruff to tasks/*.py an *__init__.py by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1469
* Setting trust_remote_code to `True` for HuggingFace datasets compatibility by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1467
* add arabic mmlu by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1402
* Add Gemma support (Add flag to control BOS token usage) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1465
* Revert "Setting trust_remote_code to `True` for HuggingFace datasets compatibility" by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1474
* Create a means for caching task registration and request building. Ad… by inf3rnus in https://github.com/EleutherAI/lm-evaluation-harness/pull/1372
* Cont metrics by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1475
* Refactor `evaluater.evaluate` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1441
* add multilingual mmlu eval by jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1484
* Update TruthfulQA val split name by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1488
* Fix AttributeError in huggingface.py When 'model_type' is Missing by richwardle in https://github.com/EleutherAI/lm-evaluation-harness/pull/1489
* Fix duplicated kwargs in some model init by lchu-ibm in https://github.com/EleutherAI/lm-evaluation-harness/pull/1495
* Add multilingual truthfulqa targets by jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1499
* Always include EOS token as stop sequence by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1480
* Improve data-parallel request partitioning for VLLM by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1477
* modify `WandbLogger` to accept arbitrary kwargs by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1491
* Vllm update DP+TP by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1508
* Setting trust_remote_code to True for HuggingFace datasets compatibility by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1487
* Cleaning up unused unit tests by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1516
* French Bench by ManuelFay in https://github.com/EleutherAI/lm-evaluation-harness/pull/1500
* Hotfix: fix TypeError in `--trust_remote_code` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1517
* Fix minor edge cases (951 1503) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1520
* Openllm benchmark by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1526
* Add a new task GPQA (the part CoT and generative) by uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1482
* Add EQ-Bench as per 1459 by pbevan1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1511
* Add WMDP Multiple-choice by justinphan3110 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1534
* Adding new task : KorMedMCQA by sean0042 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1530
* Update docs on LM.loglikelihood_rolling abstract method by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1532
* Minor KMMLU cleanup by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1502
* Cleanup and fixes (Task, Instance, and a little bit of *evaluate) by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1533
* Update installation commands in openai_completions.py and contributing document and, update wandb_args description by naem1023 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1536
* Add compatibility for vLLM's new Logprob object by Yard1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549
* Fix incorrect `max_gen_toks` generation kwarg default in code2_text. by cosmo3769 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1551
* Support jinja templating for task descriptions by HishamYahya in https://github.com/EleutherAI/lm-evaluation-harness/pull/1553
* Fix incorrect `max_gen_toks` generation kwarg default in generative Bigbench by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1546
* Hardcode IFEval to 0-shot by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1506
* add Arabic EXAMS benchmark by khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1498
* AGIEval by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1359
* cli_evaluate calls simple_evaluate with the same verbosity. by Wongboo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1563
* add manual tqdm disabling management by artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1569
* Fix README section on vllm integration by eitanturok in https://github.com/EleutherAI/lm-evaluation-harness/pull/1579
* Fix Jinja template for Advanced AI Risk by RylanSchaeffer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1587
* Proposed approach for testing CLI arg parsing by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1566
* Patch for Seq2Seq Model predictions by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1584
* Add start date in results.json by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1592
* Cleanup for v0.4.2 release by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1573
* Fix eval_logger import for mmlu/_generate_configs.py by noufmitla in https://github.com/EleutherAI/lm-evaluation-harness/pull/1593

New Contributors
* PicoCreator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1374
* michaelfeil made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1364
* mycoalchen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1398
* JeevanBhoot made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1404
* uanu2002 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1419
* giux78 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1358
* Am1n3e made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
* adavidho made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1430
* larekrow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1443
* leocnj made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1440
* ayulockin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1339
* khalil-Hennara made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1402
* inf3rnus made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1372
* jordane95 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1484
* richwardle made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1489
* lchu-ibm made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1495
* pbevan1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1511
* justinphan3110 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1534
* sean0042 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1530
* naem1023 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1536
* Yard1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549
* cosmo3769 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1551
* HishamYahya made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1553
* Wongboo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1563
* artemorloff made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1569
* eitanturok made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1579
* RylanSchaeffer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1587
* noufmitla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1593

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2

0.4.1

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by anjor .

At a high level, some of the changes include:

- Data-parallel inference using vLLM (contributed by baberabb )
- A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
- Miscellaneous documentation updates
- A number of new tasks, and bugfixes to old tasks!
- The support for OpenAI-like API models using `local-completions` or `local-chat-completions` ( Thanks to veekaybee mgoin anjor and others on this)!
- Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include
- Chat Templating + System Prompt support, for locally-run models
- Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
- General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
- A new `TaskManager` object and the deprecation of `lm_eval.tasks.initialize_tasks()`, for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed
* Announce v0.4.0 in README by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1061
* remove commented planned samplers in `lm_eval/api/samplers.py` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1062
* Confirming links in docs work (WIP) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1065
* Set actual version to v0.4.0 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1064
* Updating docs hyperlinks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1066
* Fiddling with READMEs, Reenable CI tests on `main` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1063
* Update _cot_fewshot_template_yaml by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1074
* Patch scrolls by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1077
* Update template of qqp dataset by shiweijiezero in https://github.com/EleutherAI/lm-evaluation-harness/pull/1097
* Change the sub-task name from sst to sst2 in glue by shiweijiezero in https://github.com/EleutherAI/lm-evaluation-harness/pull/1099
* Add kmmlu evaluation to tasks by h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1089
* Fix stderr by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1106
* Simplified `evaluator.py` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1104
* [Refactor] vllm data parallel by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1035
* Unpack group in `write_out` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1113
* Revert "Simplified `evaluator.py`" by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1116
* `qqp`, `mnli_mismatch`: remove unlabled test sets by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1114
* fix: bug of BBH_cot_fewshot by Momo-Tori in https://github.com/EleutherAI/lm-evaluation-harness/pull/1118
* Bump BBH version by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1120
* Refactor `hf` modeling code by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1096
* Additional process for doc_to_choice by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1093
* doc_to_decontamination_query can use function by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1082
* Fix vllm `batch_size` type by xTayEx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1128
* fix: passing max_length to vllm engine args by NanoCode012 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1124
* Fix Loading Local Dataset by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1127
* place model onto `mps` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1133
* Add benchmark FLD by MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1122
* fix typo in README.md by lennijusten in https://github.com/EleutherAI/lm-evaluation-harness/pull/1136
* add correct openai api key to README.md by lennijusten in https://github.com/EleutherAI/lm-evaluation-harness/pull/1138
* Update Linter CI Job by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1130
* add utils.clear_torch_cache() to model_comparator by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1142
* Enabling OpenAI completions via gooseai by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1141
* vllm clean up tqdm by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1144
* openai nits by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1139
* Add IFEval / Instruction-Following Eval by wiskojo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1087
* set `--gen_kwargs` arg to None by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1145
* Add shorthand flags by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1149
* fld bugfix by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1150
* Remove GooseAI docs and change no-commit-to-branch precommit hook by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1154
* Add docs on adding a multiple choice metric by polm-stability in https://github.com/EleutherAI/lm-evaluation-harness/pull/1147
* Simplify evaluator by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1126
* Generalize Qwen tokenizer fix by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1146
* self.device in huggingface.py line 210 treated as torch.device but might be a string by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1172
* Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by seungduk-yanolja in https://github.com/EleutherAI/lm-evaluation-harness/pull/1171
* feat: add option to upload results to Zeno by Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/990
* Switch Linting to `ruff` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1166
* Error in --num_fewshot option for K-MMLU Evaluation Harness by guijinSON in https://github.com/EleutherAI/lm-evaluation-harness/pull/1178
* Implementing local OpenAI API-style chat completions on any given inference server by veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1174
* Update README.md by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1184
* Update README.md by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1183
* Add tokenizer backend by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1186
* Correctly Print Task Versioning by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1173
* update Zeno example and reference in README by Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1190
* Remove tokenizer for openai chat completions by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1191
* Update README.md by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1181
* disable `mypy` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1193
* Generic decorator for handling rate limit errors by zachschillaci27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1109
* Refer in README to main branch by BramVanroy in https://github.com/EleutherAI/lm-evaluation-harness/pull/1200
* Hardcode 0-shot for fewshot Minerva Math tasks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1189
* Upstream Mamba Support (`mamba_ssm`) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1110
* Update cuda handling by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1180
* Fix documentation in API table by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1203
* Consolidate batching by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1197
* Add remove_whitespace to FLD benchmark by MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1206
* Fix the argument order in `utils.divide` doc by xTayEx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1208
* [Fix 1211 ] pin vllm at < 0.2.6 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1212
* fix unbounded local variable by onnoo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1218
* nits + fix siqa by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1216
* add length of strings and answer options to Zeno metadata by Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1222
* Don't silence errors when loading tasks by polm-stability in https://github.com/EleutherAI/lm-evaluation-harness/pull/1148
* Update README.md by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1195
* Update race's README.md by pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1230
* batch_schedular bug in Collator by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1229
* Update openai_completions.py by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1238
* vllm: handle max_length better and substitute Collator by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1241
* Remove self.dataset_path post_init process by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1243
* Add multilingual HellaSwag task by JorgeDeCorte in https://github.com/EleutherAI/lm-evaluation-harness/pull/1228
* Do not escape ascii in logging outputs by passaglia in https://github.com/EleutherAI/lm-evaluation-harness/pull/1246
* fixed fewshot loading for multiple input tasks by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1255
* Revert citation by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1257
* Specify utf-8 encoding to properly save non-ascii samples to file by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1265
* Fix evaluation for the belebele dataset by jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/1267
* Call "exact_match" once for each multiple-target sample by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1266
* MultiMedQA by tmabraham in https://github.com/EleutherAI/lm-evaluation-harness/pull/1198
* Fix bug in multi-token Stop Sequences by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1268
* Update Table Printing by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1271
* add Kobest by jp1924 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1263
* Apply `process_docs()` to fewshot_split by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1276
* Fix whitespace issues in GSM8k-CoT by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1275
* Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1261
* Allow parameter edits for registered tasks when listed in a benchmark by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1273
* Fix data-parallel evaluation with quantized models by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1270
* Rework documentation for explaining local dataset by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1284
* Update CITATION.bib by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1285
* Update `nq_open` / NaturalQs whitespacing by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1289
* Update README.md with custom integration doc by msaroufim in https://github.com/EleutherAI/lm-evaluation-harness/pull/1298
* Update nq_open.yaml by Hannibal046 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1305
* Update task_guide.md by daniellepintz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1306
* Pin `datasets` dependency at 2.15 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1312
* Fix polemo2_in.yaml subset name by lhoestq in https://github.com/EleutherAI/lm-evaluation-harness/pull/1313
* Fix `datasets` dependency to >=2.14 by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1314
* Fix group register by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1315
* Update task_guide.md by djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1316
* Update polemo2_in.yaml by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1318
* Fix: Mamba receives extra kwargs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1328
* Fix Issue regarding stderr by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1327
* Add `local-completions` support using OpenAI interface by mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1277
* fallback to classname when LM doesnt have config by nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1334
* fix a trailing whitespace that breaks a lint job by nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1335
* skip "benchmarks" in changed_tasks by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1336
* Update migrated HF dataset paths by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1332
* Don't use `get_task_dict()` in task registration / initialization by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1331
* manage default (greedy) gen_kwargs in vllm by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1341
* vllm: change default gen_kwargs behaviour; prompt_logprobs=1 by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1345
* Update links to advanced_task_guide.md by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1348
* `Filter` docs not offset by `doc_id` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1349
* Add FAQ on `lm_eval.tasks.initialize_tasks()` to README by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1330
* Refix issue regarding stderr by thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1357
* Add causalLM OpenVino models by NoushNabi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1290
* Apply some best practices and guideline recommendations to code by LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1363
* serialize callable functions in config by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1367
* delay filter init; remove `*args` by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1369
* Fix unintuitive `--gen_kwargs` behavior by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1329
* Publish to pypi by anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1194
* Make dependencies compatible with PyPI by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1378

New Contributors
* shiweijiezero made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1097
* h-albert-lee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1089
* Momo-Tori made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1118
* xTayEx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1128
* NanoCode012 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1124
* MorishT made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1122
* lennijusten made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1136
* veekaybee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1141
* wiskojo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1087
* polm-stability made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1147
* seungduk-yanolja made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1171
* Sparkier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/990
* anjor made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1184
* zachschillaci27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1109
* BramVanroy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1200
* onnoo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1218
* JorgeDeCorte made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1228
* jmichaelov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1267
* jp1924 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1263
* msaroufim made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1298
* Hannibal046 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1305
* daniellepintz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1306
* lhoestq made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1313
* djstrong made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1316
* nairbv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1334
* thnkinbtfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1357
* NoushNabi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1290
* LSinev made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1363

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.0...v0.4.1

0.4.0

What's Changed
* Replace stale `triviaqa` dataset link by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/364
* Update `actions/setup-python`in CI workflows by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365
* Bump `triviaqa` version by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/366
* Update `lambada_openai` multilingual data source by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370
* Update Pile Test/Val Download URLs by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373
* Added ToxiGen task by Thartvigsen in https://github.com/EleutherAI/lm-evaluation-harness/pull/377
* Added CrowSPairs by aflah02 in https://github.com/EleutherAI/lm-evaluation-harness/pull/379
* Add accuracy metric to crows-pairs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380
* hotfix(gpt2): Remove vocab-size logits slice by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/384
* Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by sxjscience in https://github.com/EleutherAI/lm-evaluation-harness/pull/390
* Upstream `hf-causal` and `hf-seq2seq` model implementations by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381
* Hosting arithmetic dataset on HuggingFace by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/391
* Hosting wikitext on HuggingFace by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/396
* Change device parameter to cuda:0 to avoid runtime error by Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403
* Update README installation instructions by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407
* feat: evaluation using peft models with CLM by zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/414
* Update setup.py dependencies by ret2libc in https://github.com/EleutherAI/lm-evaluation-harness/pull/416
* fix: add seq2seq peft by zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418
* Add support for load_in_8bit and trust_remote_code model params by philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422
* Hotfix: patch issues with the `huggingface.py` model classes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427
* Continuing work on refactor [WIP] by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/425
* Document task name wildcard support in README by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/435
* Add non-programmatic BIG-bench-hard tasks by yurodiviy in https://github.com/EleutherAI/lm-evaluation-harness/pull/406
* Updated handling for device in lm_eval/models/gpt2.py by nikhilpinnaparaju in https://github.com/EleutherAI/lm-evaluation-harness/pull/447
* [WIP, Refactor] Staging more changes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/465
* [Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467
* Configurable-Tasks by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/438
* single GPU automatic batching logic by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/394
* Fix bugs introduced in 394 406 and max length bug by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/472
* Sort task names to keep the same order always by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/474
* Set PAD token to EOS token by nikhilpinnaparaju in https://github.com/EleutherAI/lm-evaluation-harness/pull/448
* [Refactor] Add decorator for registering YAMLs as tasks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486
* fix adaptive batch crash when there are no new requests by jquesnelle in https://github.com/EleutherAI/lm-evaluation-harness/pull/490
* Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/426
* Create output path directory if necessary by janEbert in https://github.com/EleutherAI/lm-evaluation-harness/pull/483
* Add results of various models in json and md format by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/477
* Update config by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501
* P3 prompt task by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/493
* Evaluation Against Portion of Benchmark Data by kenhktsui in https://github.com/EleutherAI/lm-evaluation-harness/pull/480
* Add option to dump prompts and completions to a JSON file by juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/492
* Add perplexity task on arbitrary JSON data by janEbert in https://github.com/EleutherAI/lm-evaluation-harness/pull/481
* Update config by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/520
* Data Parallelism by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/488
* Fix mgpt fewshot by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/522
* Extend `dtype` command line flag to `HFLM` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/523
* Add support for loading GPTQ models via AutoGPTQ by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/519
* Change type signature of `quantized` and its default value for python < 3.11 compatibility by passaglia in https://github.com/EleutherAI/lm-evaluation-harness/pull/532
* Fix LLaMA tokenization issue by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/531
* [Refactor] Make promptsource an extra / not required for installation by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/542
* Move spaces from context to continuation by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/546
* Use max_length in AutoSeq2SeqLM by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/551
* Fix typo by kwikiel in https://github.com/EleutherAI/lm-evaluation-harness/pull/557
* Add load_in_4bit and fix peft loading by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/556
* Update task_guide.md by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/564
* [Refactor] Non-greedy generation ; WIP GSM8k yaml by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/559
* Dataset metric log [WIP] by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/560
* Add Anthropic support by zphang in https://github.com/EleutherAI/lm-evaluation-harness/pull/562
* Add MultipleChoiceExactTask by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/537
* Revert "Add MultipleChoiceExactTask" by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/568
* [Refactor] [WIP] New YAML advanced docs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/567
* Remove the registration of "GPT2" as a model type by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/574
* [Refactor] Docs update by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577
* Better docs by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/576
* Update evaluator.py cache_db argument str if model is not str by poedator in https://github.com/EleutherAI/lm-evaluation-harness/pull/575
* Add --max_batch_size and --batch_size auto:N by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/572
* [Refactor] ALL_TASKS now maintained (not static) by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/581
* Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/582
* Fix non-callable attributes in CachingLM by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/584
* Add error handling for calling `.to(device)` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585
* fixes some minor issues on tasks. by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/580
* Add - 4bit-related args by SONG-WONHO in https://github.com/EleutherAI/lm-evaluation-harness/pull/579
* Fix triviaqa task by seopbo in https://github.com/EleutherAI/lm-evaluation-harness/pull/525
* [Refactor] Addressing Feedback on new docs pages by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/578
* Logging Samples by farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/563
* Merge master into big-refactor by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/590
* [Refactor] Package YAMLs alongside pip installations of lm-eval by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/596
* fixes for multiple_choice by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/598
* add openbookqa config by farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/600
* [Refactor] Model guide docs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/606
* [Refactor] More MCQA fixes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/599
* [Refactor] Hellaswag by nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/608
* [Refactor] Seq2Seq Models with Multi-Device Support by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/565
* [Refactor] CachingLM support via `--use_cache` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/619
* [Refactor] batch generation better for `hf` model ; deprecate `hf-causal` in new release by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/613
* [Refactor] Update task statuses on tracking list by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/629
* [Refactor] `device_map` options for `hf` model type by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/625
* [Refactor] Misc. cleanup of dead code by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/609
* [Refactor] Log request arguments to per-sample json by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/624
* [Refactor] HellaSwag YAML fix by nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/639
* [Refactor] Add caveats to `parallelize=True` docs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/638
* fixed super_glue and removed unused yaml config by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/645
* [Refactor] Fix sample logging by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/646
* Add PEFT, quantization, remote code, LLaMA fix by gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/644
* [Refactor] Handle `cuda:0` device assignment by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/647
* [refactor] Add prost config by farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/640
* [Refactor] Misc. bugfixes ; edgecase quantized models by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/648
* Update __init__.py by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/650
* [Refactor] Add Lambada Multilingual by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/658
* [Refactor] Add: SWAG,RACE,Arithmetic,Winogrande,PubmedQA by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/627
* [refactor] Add qa4mre config by farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/651
* Update `generation_kwargs` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/657
* [Refactor] Move race dataset on HF to EleutherAI group by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/661
* [Refactor] Add Headqa by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/659
* [Refactor] Add Unscramble ; Toxigen ; Hendrycks_Ethics ; MathQA by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/660
* [Refactor] Port TruthfulQA (mc1 only) by nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/666
* [Refactor] Miscellaneous fixes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/676
* [Refactor] Patch to revamp-process by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/678
* Revamp process by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/671
* [Refactor] Fix padding ranks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/679
* [Refactor] minor edits by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/680
* [Refactor] Migrate ANLI tasks to yaml by yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/682
* edited output_path and added help to args by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/684
* [Refactor] Minor changes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/685
* [Refactor] typo by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/687
* [Test] fix test_evaluator.py by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/675
* Fix dummy model not invoking super class constructor by yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/688
* [Refactor] Migrate webqs task to yaml by yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/689
* [Refactor] Fix tests by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/693
* [Refactor] Migrate xwinograd tasks to yaml by yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/695
* Early stop bug of greedy_until (primary_until should be a list of str) by ZZR0 in https://github.com/EleutherAI/lm-evaluation-harness/pull/700
* Remove condition to check for `winograd_schema` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/690
* [Refactor] Use console script by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/703
* [Refactor] Fixes for when using `num_fewshot` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/702
* [Refactor] Updated anthropic to new API by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/710
* [Refactor] Cleanup for `big-refactor` by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/686
* Update README.md by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/720
* [Refactor] Benchmark scripts by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/612
* [Refactor] Fix Max Length arg by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/723
* Add note about MPS by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/728
* Update huggingface.py by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/730
* Update README.md by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/732
* [Refactor] Port over Autobatching by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/673
* [Refactor] Fix Anthropic Import and other fixes by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/724
* [Refactor] Remove Unused Variable in Make-Table by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/734
* [Refactor] logiqav2 by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/711
* [Refactor] Fix task packaging by yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/739
* [Refactor] fixed openai by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/736
* [Refactor] added some typehints by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/742
* [Refactor] Port Babi task by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/752
* [Refactor] CrowS-Pairs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/751
* Update README.md by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/745
* [Refactor] add xcopa by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/749
* Update README.md by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/764
* [Refactor] Add Blimp by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/763
* [Refactor] Use evaluation mode for accelerate to prevent OOM by tju01 in https://github.com/EleutherAI/lm-evaluation-harness/pull/770
* Patch Blimp by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/768
* [Refactor] Speedup hellaswag context building by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/774
* [Refactor] Patch crowspairs higher_is_better by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/766
* [Refactor] XNLI by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/776
* [Refactor] Update Benchmark by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/777
* [WIP] Update API docs in README by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/747
* [Refactor] Real Toxicity Prompts by aflah02 in https://github.com/EleutherAI/lm-evaluation-harness/pull/725
* [Refactor] XStoryCloze by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/759
* [Refactor] Glue by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/761
* [Refactor] Add triviaqa by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/758
* [Refactor] Paws-X by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/779
* [Refactor] MC Taco by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/783
* [Refactor] Truthfulqa by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/782
* [Refactor] fix doc_to_target processing by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/786
* [Refactor] Add README.md by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/757
* [Refactor] Don't always require Perspective API key to run by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/788
* [Refactor] Added HF model test by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/791
* [Big refactor] HF test fixup by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/793
* [Refactor] Process Whitespace for greedy_until by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/781
* [Refactor] Fix metrics in Greedy Until by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/780
* Update README.md by Wehzie in https://github.com/EleutherAI/lm-evaluation-harness/pull/803
* Merge Fix metrics branch by uSaiPrashanth in https://github.com/EleutherAI/lm-evaluation-harness/pull/802
* [Refactor] Update docs by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/744
* [Refactor] Superglue T5 Parity by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/769
* Update main.py by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/817
* [Refactor] Coqa by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/820
* [Refactor] drop by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/821
* [Refactor] Asdiv by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/813
* [Refactor] Fix IndexError by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/819
* [Refactor] toxicity: API inside function by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/822
* [Refactor] wsc273 by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/807
* [Refactor] Bump min accelerate version and update documentation by fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/812
* Add mypy baseline config by ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/809
* [Refactor] Fix wikitext task by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/833
* [Refactor] Add WMT tasks by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/775
* [Refactor] consolidated tasks tests by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/831
* Update README.md by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/838
* [Refactor] mgsm by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/784
* [Refactor] Add top-level import by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/830
* Add pyproject.toml by ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/810
* [Refactor] Additions to docs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/799
* [Refactor] Fix MGSM by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/845
* [Refactor] float16 MPS works in torch nightly by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/853
* [Refactor] Update benchmark by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/850
* Switch to pyproject.toml based project metadata by ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/854
* Use Dict to make the code python 3.8 compatible by chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/857
* [Refactor] NQopen by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/859
* [Refactor] NQ-open by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/798
* Fix "local variable 'docs' referenced before assignment" error in write_out.py by chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/856
* [Refactor] 3.8 test compatibility by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/863
* [Refactor] Cleanup dependencies by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/860
* [Refactor] Qasper, MuTual, MGSM (Native CoT) by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/840
* undefined type and output_type when using promptsource fixed by Hojjat-Mokhtarabadi in https://github.com/EleutherAI/lm-evaluation-harness/pull/842
* [Refactor] Deactivate select GH Actions by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/871
* [Refactor] squadv2 by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/785
* [Refactor] Set python3.8 as allowed version by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/862
* Fix positional arguments in HF model generate by chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/877
* [Refactor] MATH by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/861
* Create cot_yaml by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/870
* [Refactor] Port CSATQA to refactor by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/865
* [Refactor] CMMLU, C-Eval port ; Add fewshot config by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/864
* [Refactor] README.md for Asdiv by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/878
* [Refactor] Hotfixes to big-refactor by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/880
* Change Python Version to 3.8 in .pre-commit-config.yaml and GitHub Actions by chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/895
* [Refactor] Fix PubMedQA by tmabraham in https://github.com/EleutherAI/lm-evaluation-harness/pull/890
* [Refactor] Fix error when calling `lm-eval` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/899
* [Refactor] bigbench by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/852
* [Refactor] Fix wildcards by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/900
* Add transformation filters by chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/883
* [Refactor] Flan benchmark by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/816
* [Refactor] WIP: Add MMLU by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/753
* Added notable contributors to the citation block by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/907
* [Refactor] Improve error logging by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/908
* [Refactor] Add _batch_scheduler in greedy_until by AndyWolfZwei in https://github.com/EleutherAI/lm-evaluation-harness/pull/912
* add belebele by ManuelFay in https://github.com/EleutherAI/lm-evaluation-harness/pull/885
* Update README.md by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/917
* [Refactor] Precommit formatting for Belebele by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/926
* [Refactor] change all mentions of `greedy_until` to `generate_until` by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/927
* [Refactor] Squadv2 updates by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/923
* [Refactor] Verbose by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/910
* [Refactor] Fix Unit Tests by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/905
* Fix `generate_until` rename by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/929
* [Refactor] Generate_until rename by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/931
* Fix 'tqdm' object is not subscriptable" error in huggingface.py when batch size is auto by jasonkrone in https://github.com/EleutherAI/lm-evaluation-harness/pull/916
* [Refactor] Fix Default Metric Call by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/935
* Big refactor write out adaption by MicPie in https://github.com/EleutherAI/lm-evaluation-harness/pull/937
* Update pyproject.toml by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/915
* [Refactor] Fix whitespace warning by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/949
* [Refactor] Update documentation by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/954
* [Refactor]fix two bugs when ran with qasper_bool and toxigen by AndyWolfZwei in https://github.com/EleutherAI/lm-evaluation-harness/pull/934
* [Refactor] Describe local dataset usage in docs by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/956
* [Refactor] Update README, documentation by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/955
* [Refactor] Don't load MMLU auxiliary_train set by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/953
* [Refactor] Patch for Generation Until by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/957
* [Refactor] Model written eval by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/815
* [Refactor] Bugfix: AttributeError: 'Namespace' object has no attribute 'verbose' by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/966
* [Refactor] Mmlu subgroups and weight avg by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/922
* [Refactor] Remove deprecated `gold_alias` task YAML option by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/965
* [Refactor] Logging fixes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/952
* [Refactor] fixes for alternative MMLU tasks. by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/981
* [Refactor] Alias fix by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/987
* [Refactor] Minor cleanup on base `Task` subclasses by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/996
* [Refactor] add squad from master by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/971
* [Refactor] Squad misc by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/999
* [Refactor] Fix CI tests by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/997
* [Refactor] will check if group_name is None by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1001
* [Refactor] Bugfixes by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1002
* [Refactor] Verbosity rework by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/958
* add description on task/group alias by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/979
* [Refactor] Upstream ggml from big-refactor branch by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/967
* [Refactor] Improve Handling of Stop-Sequences for HF Batched Generation by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1009
* [Refactor] Update README by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1020
* [Refactor] Remove `examples/` folder by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1018
* [Refactor] vllm support by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1011
* Allow Generation arguments on greedy_until reqs by uSaiPrashanth in https://github.com/EleutherAI/lm-evaluation-harness/pull/897
* Social iqa by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1030
* [Refactor] BBH fixup by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1029
* Rename bigbench.yml to default.yml by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1032
* [Refactor] Num_fewshot process by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/985
* [Refactor] Use correct HF model type for MBart-like models by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1024
* [Refactor] Urgent fix by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1033
* [Refactor] Versioning by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1031
* fixes for sampler by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1038
* [Refactor] Update README.md by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1046
* [refactor] mps requirement by baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1037
* [Refactor] Additions to example notebook by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1048
* Miscellaneous documentation updates by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1047
* [Refactor] add notebook for overview by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1025
* Update README.md by StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1049
* [Refactor] Openai completions by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1008
* [Refactor] Added support for OpenAI ChatCompletions by DaveOkpare in https://github.com/EleutherAI/lm-evaluation-harness/pull/839
* [Refactor] Update docs ToC by haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1051
* [Refactor] Fix fewshot cot mmlu descriptions by lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1060

New Contributors
* fattorib made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/373
* Thartvigsen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/377
* aflah02 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/379
* sxjscience made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/390
* Jeffwan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/403
* zanussbaum made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/414
* ret2libc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/416
* philwee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/422
* yurodiviy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/406
* nikhilpinnaparaju made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/447
* lintangsutawika made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/438
* juletx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/472
* janEbert made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/483
* kenhktsui made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/480
* passaglia made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/532
* kwikiel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/557
* poedator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/575
* SONG-WONHO made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/579
* seopbo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/525
* farzanehnakhaee70 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/563
* nopperl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/608
* yeoedward made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/682
* ZZR0 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/700
* tju01 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/770
* Wehzie made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/803
* uSaiPrashanth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/802
* ethanhs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/809
* chrisociepa made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/857
* Hojjat-Mokhtarabadi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/842
* AndyWolfZwei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/912
* ManuelFay made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/885
* jasonkrone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/916
* MicPie made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/937
* DaveOkpare made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/839

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.3.0...v0.4.0

0.3.0

HuggingFace Datasets Integration
This release integrates HuggingFace `datasets` as the core dataset management interface, removing previous custom downloaders.

What's Changed
* Refactor `Task` downloading to use `HuggingFace.datasets` by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/300
* Add templates and update docs by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/308
* Add dataset features to `TriviaQA` by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/305
* Add `SWAG` by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/306
* Fixes for using lm_eval as a library by dirkgr in https://github.com/EleutherAI/lm-evaluation-harness/pull/309
* Researcher2 by researcher2 in https://github.com/EleutherAI/lm-evaluation-harness/pull/261
* Suggested updates for the task guide by StephenHogg in https://github.com/EleutherAI/lm-evaluation-harness/pull/301
* Add pre-commit by Mistobaan in https://github.com/EleutherAI/lm-evaluation-harness/pull/317
* Decontam import fix by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/321
* Add bootstrap_iters kwarg by Muennighoff in https://github.com/EleutherAI/lm-evaluation-harness/pull/322
* Update decontamination.md by researcher2 in https://github.com/EleutherAI/lm-evaluation-harness/pull/331
* Fix key access in squad evaluation metrics by konstantinschulz in https://github.com/EleutherAI/lm-evaluation-harness/pull/333
* Fix make_disjoint_window for tail case by richhankins in https://github.com/EleutherAI/lm-evaluation-harness/pull/336
* Manually concat tokenizer revision with subfolder by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/343
* [deps] Use minimum versioning for `numexpr` by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/352
* Remove custom datasets that are in HF by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/330
* Add `TextSynth` API by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/299
* Add the original `LAMBADA` dataset by jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/357

New Contributors
* dirkgr made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/309
* Mistobaan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/317
* konstantinschulz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/333
* richhankins made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/336

**Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.2.0...v0.3.0

0.2.0

0.1.0

- added blimp (237)
- added qasper (264)
- added asdiv (244)
- added truthfulqa (219)
- added gsm (260)
- implemented description dict and deprecated provide_description (226)
- new `--check_integrity` flag to run integrity unit tests at eval time (290)
- positional arguments to `evaluate` and `simple_evaluate` are now deprecated
- `_CITATION` attribute on task modules (292)
- lots of bug fixes and task fixes (always remember to report task versions for comparability!)

Page 1 of 2

Links

Releases

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.