unitxt Changelog

1.12.0

Main changes

* Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
* Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
* Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
* Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
* For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
* Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py

Non backward compatible changes
* changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by yoavkatz in https://github.com/IBM/unitxt/pull/1030
* Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by yoavkatz in https://github.com/IBM/unitxt/pull/1011
* Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by yoavkatz in https://github.com/IBM/unitxt/pull/1034

Changes in Catalog
* safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by dafnapension in https://github.com/IBM/unitxt/pull/1004
* Remove financebench card since it was removed from HF by elronbandel in https://github.com/IBM/unitxt/pull/1016
* add validation to tldr, remove shuffle from billsum by alonh in https://github.com/IBM/unitxt/pull/1038
* Fix typo in japanese_llama system prompt (issue 964) by bnayahu in https://github.com/IBM/unitxt/pull/1056
* numeric nlg dataset template changes by ShirApp in https://github.com/IBM/unitxt/pull/1041

Additions to catalog

* Arena hard elad2 by eladven and OfirArviv in https://github.com/IBM/unitxt/pull/1026
* Add flores101 by perlitz in https://github.com/IBM/unitxt/pull/1053
* Add metric "metrics.rag.retrieval_at_k" to catalog by matanor in https://github.com/IBM/unitxt/pull/1074
* Add Finqa dataset by ShirApp in https://github.com/IBM/unitxt/pull/962
* Allow rag context_id fields to be List[str] and not only List[int] by perlitz in https://github.com/IBM/unitxt/pull/1036
* Rag end to end task support (in progress) - by benjaminsznajder in https://github.com/IBM/unitxt/pull/1044, https://github.com/IBM/unitxt/pull/1080

New Features
* Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by luisaadanttas in https://github.com/IBM/unitxt/pull/994
* Support for ensemble by metrics eladven in https://github.com/IBM/unitxt/pull/1047
* Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by pawelknes in https://github.com/IBM/unitxt/pull/1019 pawelknes in https://github.com/IBM/unitxt/pull/1024
* Real types in tasks and metrics by elronbandel in https://github.com/IBM/unitxt/pull/1045
* Ability to create demo samplers based on instance by yoavkatz in https://github.com/IBM/unitxt/pull/1034
* add judge input to the LLM as Judge metric scores by OfirArviv in https://github.com/IBM/unitxt/pull/1064

Bug Fixes
* Solve problem with striping format at LLM as a judge code. by eladven in https://github.com/IBM/unitxt/pull/1005
* Added seed to LLM as judges for consistent results by yoavkatz in https://github.com/IBM/unitxt/pull/1029
* Fixed issues with fresh install by yoavkatz in https://github.com/IBM/unitxt/pull/1037
* WML Inference Engine fix by pawelknes in https://github.com/IBM/unitxt/pull/1013
* replace type and __type__ in type error message by perlitz in https://github.com/IBM/unitxt/pull/1035
* FinQA - filter problematic examples by ShirApp in https://github.com/IBM/unitxt/pull/1039
* demo's target prefix is now taken from demo instance by dafnapension in https://github.com/IBM/unitxt/pull/1031
* Make sure preparation times printed fully and nicely by elronbandel in https://github.com/IBM/unitxt/pull/1046
* Added prediction type to llm as jusdge to avoid warning by yoavkatz in https://github.com/IBM/unitxt/pull/1072
* Fixed confidence interval inconsistency when some metrics compute ci and some do not by dafnapension in https://github.com/IBM/unitxt/pull/1065
* Fix bug in data classes and add support for field overriding in fields containing types or functions by elronbandel in https://github.com/IBM/unitxt/pull/1027
* Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by eladven in https://github.com/IBM/unitxt/pull/1021
* Added check of type of format and system prompt to LLM as judge by yoavkatz in https://github.com/IBM/unitxt/pull/1068
* Allow assigning None in overwrites when fetching artifacts with modifications by dafnapension in https://github.com/IBM/unitxt/pull/1062
* fix - building test is not working. Updated Kaggle version. by benjaminsznajder in https://github.com/IBM/unitxt/pull/1055

Documentation changes
* Update error message and documentation on unitxt local and HF version conflict by yoavkatz in https://github.com/IBM/unitxt/pull/995
* Update llm_as_judge.rst by yoavkatz in https://github.com/IBM/unitxt/pull/1085
* Update introduction.rst add the word "a" before "variety" by welisheva22 in https://github.com/IBM/unitxt/pull/1015
* Example improvements by yoavkatz in https://github.com/IBM/unitxt/pull/1022
* Add a guide for using unitxt with lm-evaluation-harness by elronbandel in https://github.com/IBM/unitxt/pull/1020
* Fix some docs titles and links by elronbandel in https://github.com/IBM/unitxt/pull/1023
* Add example of meta evaluation of llm as judge by yoavkatz in https://github.com/IBM/unitxt/pull/1025
* Update introduction.rst - - copy edits (grammar, consistency, clarity) by welisheva22 in https://github.com/IBM/unitxt/pull/1063
* Added example for selection of demos by yoavkatz in https://github.com/IBM/unitxt/pull/1052

-----

New Contributors

We want to thank the new contributors for their first contributions!

* welisheva22 made their first contribution in https://github.com/IBM/unitxt/pull/1015
* luisaadanttas made their first contribution in https://github.com/IBM/unitxt/pull/994
* benjaminsznajder made their first contribution in https://github.com/IBM/unitxt/pull/1055
* hanansinger made their first contribution in https://github.com/IBM/unitxt/pull/1057

1.11.1

Non backward compatible changes
* The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by elronbandel in https://github.com/IBM/unitxt/pull/982
* fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by matanor in

New Features
* Add the option to specify the number of processes to use for parallel dataset loading by csrajmohan in https://github.com/IBM/unitxt/pull/974
* Add option for lazy load hf inference engine by elronbandel in https://github.com/IBM/unitxt/pull/980
* Added a format based on Huggingface format by yoavkatz in https://github.com/IBM/unitxt/pull/988

New Assets
* Add code mixing metric, add language identification task, add format for Starling model by arielge in https://github.com/IBM/unitxt/pull/956

Bug Fixes
* Fix llama_3_ibm_genai_generic_template by lga-zurich in https://github.com/IBM/unitxt/pull/978

Documentation
* Add an example that shows how to use LLM as a judge that takes the references into account… by eladven in https://github.com/IBM/unitxt/pull/981
* Improve the examples table documentation by eladven in https://github.com/IBM/unitxt/pull/976

Refactoring
* Delete empty metrics folder by elronbandel in https://github.com/IBM/unitxt/pull/984

Testing and CI/CD
* Add answer correctness tests by matanor in https://github.com/IBM/unitxt/pull/977

New Contributors
* lga-zurich made their first contribution in https://github.com/IBM/unitxt/pull/978

**Full Changelog**: https://github.com/IBM/unitxt/compare/1.10.1...1.10.2

1.11.0

Non backward compatible changes
* The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by elronbandel in https://github.com/IBM/unitxt/pull/982
* fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by matanor in

New Features
* Add the option to specify the number of processes to use for parallel dataset loading by csrajmohan in https://github.com/IBM/unitxt/pull/974
* Add option for lazy load hf inference engine by elronbandel in https://github.com/IBM/unitxt/pull/980
* Added a format based on Huggingface format by yoavkatz in https://github.com/IBM/unitxt/pull/988

New Assets
* Add code mixing metric, add language identification task, add format for Starling model by arielge in https://github.com/IBM/unitxt/pull/956

Bug Fixes
* Fix llama_3_ibm_genai_generic_template by lga-zurich in https://github.com/IBM/unitxt/pull/978

Documentation
* Add an example that shows how to use LLM as a judge that takes the references into account… by eladven in https://github.com/IBM/unitxt/pull/981
* Improve the examples table documentation by eladven in https://github.com/IBM/unitxt/pull/976

Refactoring
* Delete empty metrics folder by elronbandel in https://github.com/IBM/unitxt/pull/984

Testing and CI/CD
* Add answer correctness tests by matanor in https://github.com/IBM/unitxt/pull/977

New Contributors
* lga-zurich made their first contribution in https://github.com/IBM/unitxt/pull/978

**Full Changelog**: https://github.com/IBM/unitxt/compare/1.10.1...1.10.2

1.10.3

Non backward compatible changes
* The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by elronbandel in https://github.com/IBM/unitxt/pull/982
* fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by matanor in

New Features
* Add the option to specify the number of processes to use for parallel dataset loading by csrajmohan in https://github.com/IBM/unitxt/pull/974
* Add option for lazy load hf inference engine by elronbandel in https://github.com/IBM/unitxt/pull/980
* Added a format based on Huggingface format by yoavkatz in https://github.com/IBM/unitxt/pull/988

New Assets
* Add code mixing metric, add language identification task, add format for Starling model by arielge in https://github.com/IBM/unitxt/pull/956

Bug Fixes
* Fix llama_3_ibm_genai_generic_template by lga-zurich in https://github.com/IBM/unitxt/pull/978

Documentation
* Add an example that shows how to use LLM as a judge that takes the references into account… by eladven in https://github.com/IBM/unitxt/pull/981
* Improve the examples table documentation by eladven in https://github.com/IBM/unitxt/pull/976

Refactoring
* Delete empty metrics folder by elronbandel in https://github.com/IBM/unitxt/pull/984

Testing and CI/CD
* Add answer correctness tests by matanor in https://github.com/IBM/unitxt/pull/977

New Contributors
* lga-zurich made their first contribution in https://github.com/IBM/unitxt/pull/978

**Full Changelog**: https://github.com/IBM/unitxt/compare/1.10.1...1.10.2

1.10.2

Non backward compatible changes
* None - this release if fully compatible with the previous release.

New Features
* added num_proc parameter - Optional integer to specify the number of processes to use for parallel dataset loading by csrajmohan in https://github.com/IBM/unitxt/pull/974
* Add option to lazy load hf inference engine and fix requirements mechanism by elronbandel in https://github.com/IBM/unitxt/pull/980
* Add code mixing metric, add language identification task, add format for Starling model by arielge in https://github.com/IBM/unitxt/pull/956
* Add metrics: domesticated safety and regard by dafnapension in https://github.com/IBM/unitxt/pull/983
* Make input_format required field in InputOutputTemplate by elronbandel in https://github.com/IBM/unitxt/pull/982
* Added a format based on Huggingface format by yoavkatz in https://github.com/IBM/unitxt/pull/988

Bug Fixes
* Fix the error at the examples table by eladven in https://github.com/IBM/unitxt/pull/976
* fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. by matanor in https://github.com/IBM/unitxt/pull/969
* Fix llama_3_ibm_genai_generic_template by lga-zurich in https://github.com/IBM/unitxt/pull/978

Documentation
* Add an example that shows how to use LLM as a judge that takes the references into account… by eladven in https://github.com/IBM/unitxt/pull/981

Refactoring
* Delete empty metrics folder by elronbandel in https://github.com/IBM/unitxt/pull/984

Testing and CI/CD
* Add answer correctness tests by matanor in https://github.com/IBM/unitxt/pull/977

New Contributors
* lga-zurich made their first contribution in https://github.com/IBM/unitxt/pull/978

**Full Changelog**: https://github.com/IBM/unitxt/compare/1.10.1...1.10.2

1.10.1

Main Changes

* Continued with major improvements to the documentation including [a new code examples section ](https://unitxt.readthedocs.io/en/latest/docs/examples.html) with standalone python code that shows how to perform evaluation, add new datasets, compare formats, use LLM as judges , and more. Cards for datasets from huggingface have detailed [descriptions](https://unitxt.readthedocs.io/en/latest/catalog/catalog.cards.sst2.html). New documentation of [RAG tasks and metrics](https://unitxt.readthedocs.io/en/latest/docs/rag_support.html).
* `load_dataset` can now load cards defined in a python file (and not only in the catalog). See [example](https://github.com/IBM/unitxt/blob/57957fc0e2303cb9a4389a15a8972dfd0ed8bbce/examples/standalone_qa_evaluation.py#L47).
* The evaluation results returned from `evaluate` now include two fields `predictions` and `processed_predictions`. See [example](https://github.com/IBM/unitxt/blob/57957fc0e2303cb9a4389a15a8972dfd0ed8bbce/examples/standalone_qa_evaluation.py#L75).
* The fields can have defaults, so if they are not specified in the card, they get a default value. For example, multi-class classification has `text` as the default `text_type`. See [example](https://unitxt.readthedocs.io/en/latest/catalog/catalog.tasks.classification.multi_class.html).

Non backward compatible changes

**You need to recreate the any cards/metrics you added by running prepare/*/*.py file. You can create all cards simply by running python utils/prepare_all_artifacts.py . This will avoid the __type__ error.**

**The AddFields operator was renamed Set and CopyFields operator was renamed Copy. Note previous code should continue to work, but we renamed all existing code in the unitxt and fm-eval repos.**

* Change Artifact.type to Artifact.__type__ by elronbandel in https://github.com/IBM/unitxt/pull/933
* change CopyFields operators name to Copy by duckling69 in https://github.com/IBM/unitxt/pull/876
* Rename AddFields to Set, a name that represent its role better and concisely by elronbandel in https://github.com/IBM/unitxt/pull/903

New Features
* Allow eager execution by elronbandel in https://github.com/IBM/unitxt/pull/888
* Add view option for Task definitions in UI explorer. by yoavkatz in https://github.com/IBM/unitxt/pull/891
* Add input type checking in LoadFromDictionary by yoavkatz in https://github.com/IBM/unitxt/pull/900
* Add TokensSlice operator by elronbandel in https://github.com/IBM/unitxt/pull/902
* Make some logs critical by elronbandel in https://github.com/IBM/unitxt/pull/973
* Add LogProbInferenceEngines API and implement for OpenAI by lilacheden in https://github.com/IBM/unitxt/pull/909
* Added support for ibm-watsonx-ai inference by pawelknes in https://github.com/IBM/unitxt/pull/961
* load_dataset supports loading cards not present in local catalog by pawelknes in https://github.com/IBM/unitxt/pull/929
* Added defaults to tasks by pawelknes in https://github.com/IBM/unitxt/pull/921
* Add raw predictions and references to results by yoavkatz in https://github.com/IBM/unitxt/pull/934
* Allow add-hoc metrics and template (and Add first version of standalone example of dataset with LLM as a judge ) by eladven in https://github.com/IBM/unitxt/pull/922
* Add infer() function for end to end inference pipeline by elronbandel in https://github.com/IBM/unitxt/pull/952

Bug Fixes
* LLMaaJ implementation of MLCommons' simple-safety-tests by bnayahu in https://github.com/IBM/unitxt/pull/873
* Update gradio version on website by elronbandel in https://github.com/IBM/unitxt/pull/896
* Improve demo by elronbandel in https://github.com/IBM/unitxt/pull/898
* Fix demo and organize files by elronbandel in https://github.com/IBM/unitxt/pull/897
* Make sacrebleu robust by yoavkatz in https://github.com/IBM/unitxt/pull/892
* Fix huggingface assets to have versions and up to date readme by elronbandel in https://github.com/IBM/unitxt/pull/895
* fix(cos loader): account for slashes in cos file name by jezekra1 in https://github.com/IBM/unitxt/pull/904
* llama3 instruct and chat system prompts by oktie in https://github.com/IBM/unitxt/pull/950
* Added trust_remote_code to HF dataset query operations by yoavkatz in https://github.com/IBM/unitxt/pull/911

Documentation
* Update llm_as_judge.rst by yoavkatz in https://github.com/IBM/unitxt/pull/970
* Michal Jacovi's completed manual review of the card descriptions by dafnapension in https://github.com/IBM/unitxt/pull/883
* In card preparers, generate the tags with "singletons" rather than values paired with True by dafnapension in https://github.com/IBM/unitxt/pull/874
* Improved documentation by yoavkatz in https://github.com/IBM/unitxt/pull/886
* Update glossary.rst by yoavkatz in https://github.com/IBM/unitxt/pull/899
* Add example section to documentation by yoavkatz in https://github.com/IBM/unitxt/pull/917
* Added example of open qa using catalog by yoavkatz in https://github.com/IBM/unitxt/pull/919
* Update example intro and simplified WNLI cards by yoavkatz in https://github.com/IBM/unitxt/pull/923
* Update adding_metric.rst by yoavkatz in https://github.com/IBM/unitxt/pull/955
* RAG documentation by yoavkatz in https://github.com/IBM/unitxt/pull/928
* docs: update adding_dataset.rst by eltociear in https://github.com/IBM/unitxt/pull/927
* prepare for __description__= that is different from those embedded automtically by dafnapension in https://github.com/IBM/unitxt/pull/937
* Add simple LLM as a judge example, of using it without installaiotn by eladven in https://github.com/IBM/unitxt/pull/968
* Add example of using LLM as a judge for summarization dataset. by eladven in https://github.com/IBM/unitxt/pull/965
* Improve operators documentation by elronbandel in https://github.com/IBM/unitxt/pull/942

New Assets
* Add numeric nlg dataset by ShirApp in https://github.com/IBM/unitxt/pull/882
* Add to_list_by_hyphen_space processor by marukaz in https://github.com/IBM/unitxt/pull/872
* Added tags and descriptions to safety cards by bnayahu in https://github.com/IBM/unitxt/pull/887
* Add Mt-Bench datasets + add operators by OfirArviv in https://github.com/IBM/unitxt/pull/870
* Touch up numeric nlg by elronbandel in https://github.com/IBM/unitxt/pull/889
* split train to train and validation sets in billsum by alonh in https://github.com/IBM/unitxt/pull/901
* modified wikitq, tab_fact taskcards by ShirApp in https://github.com/IBM/unitxt/pull/963
Implementation of TruthfulQA by bnayahu in https://github.com/IBM/unitxt/pull/931
* Add bluebench cards by perlitz in https://github.com/IBM/unitxt/pull/918
* Add LlamaIndex faithfulness metric by arielge in https://github.com/IBM/unitxt/pull/971
* Expanded template support for safety cards by bnayahu in https://github.com/IBM/unitxt/pull/943

Testing and CI/CD
* Add end to end realistic test to fusion by elronbandel in https://github.com/IBM/unitxt/pull/940
* Moved test_examples to run the actual examples by yoavkatz in https://github.com/IBM/unitxt/pull/913
* Use uv for installing requirements in actions by elronbandel in https://github.com/IBM/unitxt/pull/960
* Add ability to print_dict to print selected fields by yoavkatz in https://github.com/IBM/unitxt/pull/947
* Get rid of pkg_resources dependency by elronbandel in https://github.com/IBM/unitxt/pull/932
* adapt filtering lambda to datasets 2.20 by dafnapension in https://github.com/IBM/unitxt/pull/930
* Increase preparation log to error. by elronbandel in https://github.com/IBM/unitxt/pull/959

New Contributors
* ShirApp made their first contribution in https://github.com/IBM/unitxt/pull/882
* oktie made their first contribution in https://github.com/IBM/unitxt/pull/950

**Full Changelog**: https://github.com/IBM/unitxt/compare/1.10.0...1.10.1

Unitxt

Page 5 of 10