Main changes
* Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
* Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
* Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
* Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
* For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
* Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py
Non backward compatible changes
* changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by yoavkatz in https://github.com/IBM/unitxt/pull/1030
* Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by yoavkatz in https://github.com/IBM/unitxt/pull/1011
* Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by yoavkatz in https://github.com/IBM/unitxt/pull/1034
Changes in Catalog
* safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by dafnapension in https://github.com/IBM/unitxt/pull/1004
* Remove financebench card since it was removed from HF by elronbandel in https://github.com/IBM/unitxt/pull/1016
* add validation to tldr, remove shuffle from billsum by alonh in https://github.com/IBM/unitxt/pull/1038
* Fix typo in japanese_llama system prompt (issue 964) by bnayahu in https://github.com/IBM/unitxt/pull/1056
* numeric nlg dataset template changes by ShirApp in https://github.com/IBM/unitxt/pull/1041
Additions to catalog
* Arena hard elad2 by eladven and OfirArviv in https://github.com/IBM/unitxt/pull/1026
* Add flores101 by perlitz in https://github.com/IBM/unitxt/pull/1053
* Add metric "metrics.rag.retrieval_at_k" to catalog by matanor in https://github.com/IBM/unitxt/pull/1074
* Add Finqa dataset by ShirApp in https://github.com/IBM/unitxt/pull/962
* Allow rag context_id fields to be List[str] and not only List[int] by perlitz in https://github.com/IBM/unitxt/pull/1036
* Rag end to end task support (in progress) - by benjaminsznajder in https://github.com/IBM/unitxt/pull/1044, https://github.com/IBM/unitxt/pull/1080
New Features
* Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by luisaadanttas in https://github.com/IBM/unitxt/pull/994
* Support for ensemble by metrics eladven in https://github.com/IBM/unitxt/pull/1047
* Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by pawelknes in https://github.com/IBM/unitxt/pull/1019 pawelknes in https://github.com/IBM/unitxt/pull/1024
* Real types in tasks and metrics by elronbandel in https://github.com/IBM/unitxt/pull/1045
* Ability to create demo samplers based on instance by yoavkatz in https://github.com/IBM/unitxt/pull/1034
* add judge input to the LLM as Judge metric scores by OfirArviv in https://github.com/IBM/unitxt/pull/1064
Bug Fixes
* Solve problem with striping format at LLM as a judge code. by eladven in https://github.com/IBM/unitxt/pull/1005
* Added seed to LLM as judges for consistent results by yoavkatz in https://github.com/IBM/unitxt/pull/1029
* Fixed issues with fresh install by yoavkatz in https://github.com/IBM/unitxt/pull/1037
* WML Inference Engine fix by pawelknes in https://github.com/IBM/unitxt/pull/1013
* replace type and __type__ in type error message by perlitz in https://github.com/IBM/unitxt/pull/1035
* FinQA - filter problematic examples by ShirApp in https://github.com/IBM/unitxt/pull/1039
* demo's target prefix is now taken from demo instance by dafnapension in https://github.com/IBM/unitxt/pull/1031
* Make sure preparation times printed fully and nicely by elronbandel in https://github.com/IBM/unitxt/pull/1046
* Added prediction type to llm as jusdge to avoid warning by yoavkatz in https://github.com/IBM/unitxt/pull/1072
* Fixed confidence interval inconsistency when some metrics compute ci and some do not by dafnapension in https://github.com/IBM/unitxt/pull/1065
* Fix bug in data classes and add support for field overriding in fields containing types or functions by elronbandel in https://github.com/IBM/unitxt/pull/1027
* Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by eladven in https://github.com/IBM/unitxt/pull/1021
* Added check of type of format and system prompt to LLM as judge by yoavkatz in https://github.com/IBM/unitxt/pull/1068
* Allow assigning None in overwrites when fetching artifacts with modifications by dafnapension in https://github.com/IBM/unitxt/pull/1062
* fix - building test is not working. Updated Kaggle version. by benjaminsznajder in https://github.com/IBM/unitxt/pull/1055
Documentation changes
* Update error message and documentation on unitxt local and HF version conflict by yoavkatz in https://github.com/IBM/unitxt/pull/995
* Update llm_as_judge.rst by yoavkatz in https://github.com/IBM/unitxt/pull/1085
* Update introduction.rst add the word "a" before "variety" by welisheva22 in https://github.com/IBM/unitxt/pull/1015
* Example improvements by yoavkatz in https://github.com/IBM/unitxt/pull/1022
* Add a guide for using unitxt with lm-evaluation-harness by elronbandel in https://github.com/IBM/unitxt/pull/1020
* Fix some docs titles and links by elronbandel in https://github.com/IBM/unitxt/pull/1023
* Add example of meta evaluation of llm as judge by yoavkatz in https://github.com/IBM/unitxt/pull/1025
* Update introduction.rst - - copy edits (grammar, consistency, clarity) by welisheva22 in https://github.com/IBM/unitxt/pull/1063
* Added example for selection of demos by yoavkatz in https://github.com/IBM/unitxt/pull/1052
-----
New Contributors
We want to thank the new contributors for their first contributions!
* welisheva22 made their first contribution in https://github.com/IBM/unitxt/pull/1015
* luisaadanttas made their first contribution in https://github.com/IBM/unitxt/pull/994
* benjaminsznajder made their first contribution in https://github.com/IBM/unitxt/pull/1055
* hanansinger made their first contribution in https://github.com/IBM/unitxt/pull/1057