Lighteval

Latest version: v0.8.1

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.8.0

What's new

Tasks
* [LiveCodeBench](https://livecodebench.github.io/) by plaguss in #548, 587, 518
* [GPQA diamond](https://arxiv.org/abs/2311.12022) by lewtun in #534
* [Humanity's last exam](https://agi.safe.ai/) by clefourrier in #520
* [Olympiad Bench](https://github.com/OpenBMB/OlympiadBench) by NathanHB in #521
* [aime24, 25](https://aime25.aimedicine.info/) and [math500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) by NathanHB in #586
* french models Evals by mdiazmel in 505

Metrics
* Passk by clefourrier in 519
* Extractive Match metric by hynky1999 in 495, 503, 522, 535

Features
Better logging
* log model config by NathanHB in 627
* Support custom results/details push to hub by albertvillanova in 457
* Push details without converting fields to str by NathanHB in 572
Inference providers
* adds inference providers support by NathanHB in 616
Load details to be evaluated
* Implemented the possibility to load predictions from details files and continue evaluating from there by JoelNiklaus in 488
sglang support
* sglang by Jayon02 in 552

Bug Fixes and refacto
* Tiny improvements to `endpoint_model.py`, `base_model.py`,... by sadra-barikbin in 219
* Update README.md by NathanHB in 486
* Fix issue with encodings for together models. by JoelNiklaus in 483
* Made litellm judge backend more robust. by JoelNiklaus in 485
* Fix `T_co` import bug by gucci-j in 484
* fix README link by vxw3t8fhjsdkghvbdifuk in 500
* Fixed issue with o1 in litellm. by JoelNiklaus in 493
* Hotfix for litellm judge by JoelNiklaus in 490
* Made judge response processing more robust. by JoelNiklaus in 491
* VLLM: Allows for max tokens to be set in model config file by NathanHB in 547
* Bump up the latex2sympy2_extended version + more tests by hynky1999 in 510
* Fixed bug of import url_to_fs from fsspec by LoserCheems in 507)
* Fix Ukrainian indices and confirmation word by ayukh in 516
* Fix VLLM data-parallel by hynky1999 in 541
* relax spacy import to relax dep by clefourrier in 622
* vllm fix sampling params by NathanHB in 625
* relax deps for tgi by NathanHB in 626
* Bug fix extractive match by hynky1999 in 540
* Fix loading of vllm model from files by NathanHB in 533
* fix: broken URLs by deep-diver in 550
* typo(vllm): `gpu_memory_utilisation` typo by tpoisonooo in 553
* allows better flexibility for litellm endpoints by NathanHB in 549
* Translate task template to Catalan and Galician and fix typos by mariagrandury in 506
* Relax upper bound on torch by lewtun in 508
* Fix vLLM generation with sampling params by lewtun in 578
* Make BLEURT lazy by hynky1999 in 536
* Fixing backend error in main_sglang. by TankNee in 597
* VLLM + Math-Verify fixes by hynky1999 in 603
* raise exception when generation size is more than model length by NathanHB in 571

Thanks
Huge thanks to Hyneck, Lewis, Ben, Agustín, Elie and everyone helping and and giving feedback 💙

Significant community contributions
The following contributors have made significant changes to the library over the last release:
* hynky1999
* Extractive Match metric (495)
* Fix math extraction (503)
* Bump up the latex2sympy2_extended version + more tests (510)
* Math extraction - allow only trying the first match, more customizable latex extraction + bump deps (522)
* add missing inits (524)
* Sync Math-verify (535)
* Make BLEURT lazy (536)
* Bug fix extractive match (540)
* Fix VLLM data-parallel (541)
* VLLM + Math-Verify fixes (603)
* plaguss
* Add extended task for LiveCodeBench codegeneration (548)
* Add subsets for lcb (587)
* Jayon02
* Let lighteval support sglang (552)
* NathanHB
* adds olympiad bench (521)
* Fix loading of vllm model from files (533)
* [VLLM] Allows for max tokens to be set in model config file (547)
* allows better flexibility for litellm endpoints (549)
* raise exception when generation size is more than model length (571)
* Push details without converting fields to str (572)
* adds aime24, 25 and math500 (586)
* adds inference providers support (616)
* vllm fix sampling params (625)
* relax deps for tgi (626)
* log model config (627)

0.7.0

What's New
New Tasks
* added musr by clefourrier in 375
* Adds Global MLMU by hynky1999 in 426
* Add new Arabic benchmarks (5) and enhance existing tasks by alielfilali01 in 372

New Features
* Evaluate a model already loaded in memory for training / evaluation loop by clefourrier in 390
* Allowing a single prompt to use several formats for one eval by clefourrier in 398
* Autoscaling inference endpoints hardware by clefourrier in 412
* CLI new look and features (using typer) by NathanHB in 407
* Better Looking and more functional logging by NathanHB in 415
* Add litellm backend by JoelNiklaus in 385

More Translation Literals by the Community
* add bashkir variants by AigizK in 374
* add Shan (shn) translation literals by NoerNova in 376
* Add Udmurt (udm) translation literals by codemurt in 381
* This PR adds translation literals for Belarusian language. by Kryuski in 382
* added tatar literals by gaydmi in 383

New Doc
* Add doc-builder doc-pr-upload GH Action by albertvillanova in 411
* Set up docs by albertvillanova in 403
* Add docstring docs by albertvillanova in 413
* Add missing models to docs by albertvillanova in 419
* Update docs about inference endpoints by albertvillanova in 432
* Upgrade deprecated GH Action cachev2 by albertvillanova in 456
* Add EvaluationTracker to docs and fix its docstring by albertvillanova in 464
* Checkout PR merge commit for CI tests by albertvillanova in 468

Bug Fixes and Refacto
* Allow AdapterModels to have custom tokens by mapmeld in 306
* Homogeneize generation params by clefourrier in 428
* fix: cache directory variable by NazimHAli in 378
* Add trufflehog secrets detection by albertvillanova in 429
* greedy_until() fix by vsabolcec in 344
* Fixes a TypeError for generative metrics. by JoelNiklaus in 386
* Speed up Bootstrapping Computation by JoelNiklaus in 409
* Fix imports from model_config by albertvillanova in 443
* Fix wrong instructions and code for custom tasks by albertvillanova in 450
* Fix minor typos by albertvillanova in 449
* fix model parallel by NathanHB in 481
* add configs with their models by clefourrier in 421
* Fixes a TypeError in Sacrebleu. by JoelNiklaus in 387
* fix ukr/rus by hynky1999 in 394
* fix repeated cleanup by anton-l in 399
* Update instance type/size in endpoint model_config example by albertvillanova in 401
* Considering the case empty request list is given to base model by sadra-barikbin in 250
* Fix a tiny bug in `PromptManager::FewShotSampler::_init_fewshot_sampling_random` by sadra-barikbin in 423
* Fix splitting for generative tasks by NathanHB in 400
* Fixes an error with getting the golds from the formatted_docs. by JoelNiklaus in 388
* Fix ignored reuse_existing in config file by albertvillanova in 431
* Deprecate Obsolete Config Properties by ParagEkbote in 433
* fix: LightevalTaskConfig.stop_sequence attribute by ryan-minato in 463
* fix: scorer attribute initialization in ROUGE by ryan-minato in 471
* Delete endpoint on InferenceEndpointTimeoutError by albertvillanova in 475
* Remove unnecessary deepcopy in evaluation_tracker by albertvillanova in 459
* fix: CACHE_DIR Default Value in Accelerate Pipeline by ryan-minato in 461
* Fix warning about precedence of custom tasks over default ones in registry by albertvillanova in 466
* Implement TGI model config from path by albertvillanova in 448


Significant community contributions

The following contributors have made significant changes to the library over the last release:

* clefourrier
* added musr (375)
* Update README.md
* Use the programmatic interface using an already in memory loaded model (390)
* Pr sadra (393)
* Allowing a single prompt to use several formats for one eval (398)
* Autoscaling inference endpoints (412)
* add configs with their models (421)
* Fix custom arabic tasks (440)
* Adds serverless endpoints back (445)
* Homogeneize generation params (428)
* JoelNiklaus
* Fixes a TypeError for generative metrics. (386)
* Fixes a TypeError in Sacrebleu. (387)
* Fixes an error with getting the golds from the formatted_docs. (388)
* Speed up Bootstrapping Computation (409)
* Add litellm inference (385)
* albertvillanova
* Update instance type/size in endpoint model_config example (401)
* Typo in feature-request.md (406)
* Add doc-builder doc-pr-upload GH Action (411)
* Set up docs (403)
* Add docstring docs (413)
* Add missing models to docs (419)
* Add trufflehog secrets detection (429)
* Update docs about inference endpoints (432)
* Fix ignored reuse_existing in config file (431)
* Test inference endpoint model config parsing from path (434)
* Fix imports from model_config (443)
* Fix wrong instructions and code for custom tasks (450)
* Fix minor typos (449)
* Implement TGI model config from path (448)
* Upgrade deprecated GH Action cachev2 (456)
* Add EvaluationTracker to docs and fix its docstring (464)
* Remove unnecessary deepcopy in evaluation_tracker (459)
* Fix warning about precedence of custom tasks over default ones in registry (466)
* Checkout PR merge commit for CI tests (468)
* Delete endpoint on InferenceEndpointTimeoutError (475)
* NathanHB
* Fix splitting for generative tasks (400)
* Nathan refacto cli (407)
* redo logging (415)
* option to list custom tasks (425)
* fix model parallel (481)
* ParagEkbote
* Deprecate Obsolete Config Properties (433)
* alielfilali01
* Add new Arabic benchmarks (5) and enhance existing tasks (372)
* Update arabic_evals.py: Fix custom arabic tasks [2nd attempt] (444)

0.6.0

What's New

Lighteval becomes massively multilingual!
We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.

* Add 3 NLI tasks supporting 26 unique languages. 329 by hynky1999
* [xnli](https://aclanthology.org/D18-1269/)
* [xnli2.0](https://arxiv.org/abs/2301.06527)
* [indic_xnli](https://arxiv.org/abs/2204.08776)
* [cmnli + ocnli](https://arxiv.org/abs/2004.05986)
* [rcb](https://arxiv.org/abs/2401.04531)

* Add 3 COPA tasks supporting about 20 unique languages. 330 by hynky1999
* [xcopa](https://aclanthology.org/2020.emnlp-main.185/)
* [indic-copa](https://arxiv.org/pdf/2212.05409)
* [parus](https://russiansuperglue.com/tasks/task_info/PARus)

* Add Hellaswag tasks supporting about 36 unique languages. 332 by hynky1999
* [mlmm_hellaswag](https://github.com/nlp-uoregon/mlmm-evaluation)
* hellaswag_{tha/tur}

* Add RC tasks supporting about 130 unique languages/scripts. 333 by hynky1999
* [xquad](https://arxiv.org/abs/1910.11856)
* [thaiqa]()
* [sber_squad](https://arxiv.org/abs/1912.09723)
* [arcd](https://arxiv.org/pdf/1906.05394)
* [kenswquad](https://arxiv.org/abs/2205.02364)
* [chinese_squad](https://github.com/pluto-junzeng/ChineseSquad)
* [cmrc2018](https://arxiv.org/abs/1810.07366)
* [indicqa](https://arxiv.org/abs/2407.13522)
* [fquad_v2](https://arxiv.org/abs/2002.06071)
* [tydiqa](https://arxiv.org/abs/2003.05002)
* [beleble](https://arxiv.org/abs/2308.16884)

* Add GK tasks supporting about 35 unique languages/scripts. 338 by hynky1999
* meta_mmlu
* mlmm_mmlu
* rummlu
* mmlu_ara_mcf
* tur_leaderboard_mmlu
* cmmlu
* mmlu
* ceval
* mlmm_arc_challenge
* alghafa_arc_easy
* community_arc
* community_truthfulqa
* exams
* m3exams
* thai_exams
* xcsqa
* alghafa_piqa
* mera_openbookqa
* alghafa_openbookqa
* alghafa_sciqa
* mathlogic_qa
* agieval
* mera_worldtree

* Misc Tasks 339 by hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
* Serbian LLM Benchmark Task by DeanChugall in 340
* iroko bench by hynky1999 in 357

Other Tasks
* MixEval Task by NathanHB in 337

Features
* Now Evaluate OpenAI models by NathanHB in 359
* New Doc and README by NathanHB in 327
* Refacto LLM as A Judge by NathanHB in 337
* Selecting tasks using their superset by hynky1999 in 308
* Nicer output on task search failure by hynky1999 in 357
* Adds tasks templating by hynky1999 in 335
* Support for multilingual generative metrics by hynky1999 in 293
* Class implementations of faithfulness and extractiveness metrics by chuandudx in 323
* Translation literals by hynky1999 in 356

Bug Fixes
* Math normalization: do not crash on invalid format by guipenedo in 331
* Skipping push to hub test by clefourrier in 334
* Fix Metrics import path in community task template file. by chuandudx in 309
* Allow kwargs for BERTScore compute function and remove unused var by chuandudx in 311
* Fixes sampling for vllm when num_samples==1 by edbeeching in 343
* Fix the dataset loading for custom tasks by clefourrier in 364
* Fix: missing property tag in inference endpoints by clefourrier in 368
* Fix Tokenization + misc fixes by hynky1999 in 354
* Fix BLEURT evaluation errors by chuandudx in 316
* Adds Baseline workflow + fixes by hynky1999 in 363

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* hynky1999
* Support for multilingual generative metrics (293)
* Adds tasks templating (335)
* Multilingual NLI Tasks (329)
* Multilingual COPA tasks (330)
* Multilingual Hellaswag tasks (332)
* Multilingual Reading Comprehension tasks (333)
* Multilingual General Knowledge tasks (338)
* Selecting tasks using their superset (308)
* Fix Tokenization + misc fixes (354)
* Misc-multilingual tasks (339)
* add iroko bench + nicer output on task search failure (357)
* Translation literals (356)
* selected tasks for multilingual evaluation (371)
* Adds Baseline workflow + fixes (363)
* DeanChugall
* Serbian LLM Benchmark Task (340)
* NathanHB
* readme rewrite (327)
* refacto judge and add mixeval (337)
* bump lighteval versoin (328)
* fix (347)
* Nathan llm judge quickfix (348)
* Nathan llm judge quickfix (350)
* adds openai models (359)

New Contributors
* chuandudx made their first contribution in https://github.com/huggingface/lighteval/pull/323
* edbeeching made their first contribution in https://github.com/huggingface/lighteval/pull/343
* DeanChugall made their first contribution in https://github.com/huggingface/lighteval/pull/340
* Stopwolf made their first contribution in https://github.com/huggingface/lighteval/pull/225
* martinscooper made their first contribution in https://github.com/huggingface/lighteval/pull/366

**Full Changelog**: https://github.com/huggingface/lighteval/compare/v0.5.0...v0.6.0

0.5.0

What's new

Features
* Tokenization-wise encoding by hynky1999 in 287
* Task config by hynky1999 in 289

Bug fixes
* Fixes bug: You can't create a model without either a list of model_args or a model_config_path when model_config_path was submited by NathanHB in 298
* skip tests if secrets not provided by hynky1999 in 304
* [FIX] vllm backend by NathanHB in 317

0.4.0

What's new

Features
* Adds vlmm as backend for insane speed up by NathanHB in 274
* Add llm_as_judge in metrics (using both OpenAI or Transformers) by NathanHB in 146
* Abale to use config files for models by clefourrier in 131
* List available tasks in the cli `lighteval tasks --list` by DimbyTa in 142
* Use torch compile for speed up by clefourrier in 248
* Add majk metric by clefourrier in 158
* Adds a dummy/random model for baseline init by guipenedo in 220
* lighteval is now a cli tool: `lighteval --args` by NathanHB in 152
* We can now log info from the metrics (for example input and response from llm_as_judge) by NathanHB in 157
* Configurable task versioning by PhilipMay in 181
* Programmatic interface by clefourrier in 269
* Probability Metric + New Normalization by hynky1999 in 276
* Add widgets to the README by clefourrier in 145

New tasks
* Add `Ger-RAG-eval`tasks. by PhilipMay in 149
* adding `aimo` custom eval by NathanHB in 154

Fixes
* Bump nltlk to 3.9.1 to fix security issue by NathanHB in 137
* Fix max_length type when being passed in model args by csarron in 138
* Fix nanotron models input size bug by clefourrier in 156
* Fix MATH normalization by lewtun in 162
* fix Prompt function names by clefourrier in 168
* Fix prompt format german rag community task by jphme in 171
* add 'cite as' section in readme by NathanHB in 178
* Fix broken link to extended tasks in README by alexrs in 182
* Mention HF_TOKEN in readme by Wauplin in 194
* Download BERT scorer lazily by sadra-barikbin in 190
* Updated tgi_model and added parameters for endpoint_model by shaltielshmid in 208
* fix llm as judge warnings by NathanHB in 173
* ADD GPT-4 as Judge by philschmid in 206
* Fix a few typos and do a tiny refactor by sadra-barikbin in 187
* Avoid truncating the outputs based on string lengths by anton-l in 201
* Now only uses functions for prompt definition by clefourrier in 213
* Data split depending on eval params by clefourrier in 169
* should fix most inference endpoints issues of version config by clefourrier in 226
* Fix _init_max_length in base_model.py by gucci-j in 185
* Make evaluator invariant of input request type order by sadra-barikbin in 215
* Fixing issues with multichoice_continuations_start_space - was not parsed properly by clefourrier in 232
* Fix IFEval metric by lewtun in 259
* change priority when choosing model dtype by NathanHB in 263
* Add grammar option to generation by sadra-barikbin in 242
* make info loggers dataclass, so that their properties have expected lifetime by hynky1999 in 280
* Remove expensive prediction run during test collection by hynky1999 in 279
* Example Configs and Docs by RohitMidha23 in 255
* Refactoring the few shot management by clefourrier in 272
* Standalone nanotron config by hynky1999 in 285
* Logging Revamp by hynky1999 in 284
* bump nltk version by NathanHB in 290

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* NathanHB
* commit (137)
* Add llm as judge in metrics (146)
* Nathan add logging to metrics (157)
* add 'cite as' section in readme (178)
* Fix citation section in readme (180)
* adding aimo custom eval (154)
* fix llm as judge warnings (173)
* launch lighteval using `lighteval --args` (152)
* adds llm as judge using transformers (223)
* Fix missing json file (264)
* change priority when choosing model dtype (263)
* fix the location of tasks list in the readme (267)
* updates ifeval repo (268)
* fix nanotron (283)
* add vlmm backend (274)
* bump nltk version (290)
* clefourrier
* Add config files for models (131)
* Add fun widgets to the README (145)
* Fix nanotron models input size bug (156)
* no function we actually use should be named prompt_fn (168)
* Add majk metric (158)
* Homogeneize logging system (150)
* Use only dataclasses for task init (212)
* Now only uses functions for prompt definition (213)
* Data split depending on eval params (169)
* should fix most inference endpoints issues of version config (226)
* Add metrics as functions (214)
* Quantization related issues (224)
* Update issue templates (235)
* remove latex writer since we don't use it (231)
* Removes default bert scorer init (234)
* fix (233)
* udpated piqa (222)
* uses torch compile if provided (248)
* Fix inference endpoint config (244)
* Expose samples via the CLI (228)
* Fixing issues with multichoice_continuations_start_space - was not parsed properly (232)
* Programmatic interface + cleaner management of requests (269)
* Small file reorg (only renames/moves) (271)
* Refactoring the few shot management (272)
* PhilipMay
* Add `Ger-RAG-eval`tasks. (149)
* Add version config option. (181)
* shaltielshmid
* Added Namespace parameter for InferenceEndpoints, added option for passing model config directly (147)
* Updated tgi_model and added parameters for endpoint_model (208)
* hynky1999
* make info loggers dataclass, so that their properties have expected lifetime (280)
* Remove expensive prediction run during test collection (279)
* Probability Metric + New Normalization (276)
* Standalone nanotron config (285)
* Logging Revamp (284)

0.3.0

Not secure
Release Note

This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:
- Big Bench Hard: https://huggingface.co/papers/2210.09261
- AGIEval: https://huggingface.co/papers/2304.06364
- TinyBench:
- MT Bench: https://huggingface.co/papers/2306.05685
- AlGhafa Benchmarking Suite: https://aclanthology.org/2023.arabicnlp-1.21/

MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.

New tasks
* Add BBH by clefourrier in https://github.com/huggingface/lighteval/pull/7, bilgehanertan in https://github.com/huggingface/lighteval/pull/126
* Add AGIEval by clefourrier in https://github.com/huggingface/lighteval/pull/121
* Adding TinyBench by clefourrier in https://github.com/huggingface/lighteval/pull/104
* Adding support for Arabic benchmarks : AlGhafa benchmarking suite by alielfilali01 in https://github.com/huggingface/lighteval/pull/95
* Add mt-bench by NathanHB in https://github.com/huggingface/lighteval/pull/75

Features
* Extended Tasks ! by clefourrier in https://github.com/huggingface/lighteval/pull/101, lewtun in https://github.com/huggingface/lighteval/pull/108, NathanHB in https://github.com/huggingface/lighteval/pull/122, https://github.com/huggingface/lighteval/pull/123
* Added support for launching inference endpoint with different model dtypes by shaltielshmid in https://github.com/huggingface/lighteval/pull/124

Documentation
* Adding LICENSE by clefourrier in https://github.com/huggingface/lighteval/pull/86, NathanHB in https://github.com/huggingface/lighteval/pull/89
* Make it clearer in the README that the leaderboard uses the harness by clefourrier in https://github.com/huggingface/lighteval/pull/94

Small patches
* Update huggingface-hub for compatibility with datasets 2.18 by clefourrier in https://github.com/huggingface/lighteval/pull/84
* Tidy up dependency groups by lewtun in https://github.com/huggingface/lighteval/pull/81
* bump git python by NathanHB in https://github.com/huggingface/lighteval/pull/90
* Sets a max length for the MATH task by clefourrier in https://github.com/huggingface/lighteval/pull/83
* Fix parallel data processing bug by clefourrier in https://github.com/huggingface/lighteval/pull/92
* Change the eos condition for GSM8K by clefourrier in https://github.com/huggingface/lighteval/pull/85
* Fixing rolling loglikelihood management by clefourrier in https://github.com/huggingface/lighteval/pull/78
* Fixes input length management for generative evals by clefourrier in https://github.com/huggingface/lighteval/pull/103
* Reorder addition of instruction in chat template by clefourrier in https://github.com/huggingface/lighteval/pull/111
* Ensure chat models terminate generation with EOS token by lewtun in https://github.com/huggingface/lighteval/pull/115
* Fix push details to hub by NathanHB in https://github.com/huggingface/lighteval/pull/98
* Small fixes to InferenceEndpointModel by shaltielshmid in https://github.com/huggingface/lighteval/pull/112
* Fix import typo autogptq by clefourrier in https://github.com/huggingface/lighteval/pull/116
* Fixed the loglikelihood method in inference endpoints models by clefourrier in https://github.com/huggingface/lighteval/pull/119
* Fix TextGenerationResponse import from hfh by Wauplin in https://github.com/huggingface/lighteval/pull/129
* Do not use deprecated list_files_info by Wauplin in https://github.com/huggingface/lighteval/pull/133
* Update test workflow name to 'Tests' by Wauplin in https://github.com/huggingface/lighteval/pull/134

New Contributors
* shaltielshmid made their first contribution in https://github.com/huggingface/lighteval/pull/112
* bilgehanertan made their first contribution in https://github.com/huggingface/lighteval/pull/126
* Wauplin made their first contribution in https://github.com/huggingface/lighteval/pull/129

**Full Changelog**: https://github.com/huggingface/lighteval/compare/v0.2.0...v0.3.0

Page 1 of 2

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.