Unitxt 1.12.2
Main changes
- Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed) - Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
- Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
- Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
- For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
- Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py
Non backward compatible changes
- changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
- Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
- Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034
Changes in Catalog
- safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
- Remove financebench card since it was removed from HF by @elronbandel in #1016
- add validation to tldr, remove shuffle from billsum by @alonh in #1038
- Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
- numeric nlg dataset template changes by @ShirApp in #1041
Additions to catalog
- Arena hard elad2 by @eladven and @OfirArviv in #1026
- Add flores101 by @perlitz in #1053
- Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
- Add Finqa dataset by @ShirApp in #962
- Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
- Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080
New Features
- Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
- Support for ensemble by metrics @eladven in #1047
- Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
- Real types in tasks and metrics by @elronbandel in #1045
- Ability to create demo samplers based on instance by @yoavkatz in #1034
- add judge input to the LLM as Judge metric scores by @OfirArviv in #1064
Bug Fixes
- Solve problem with striping format at LLM as a judge code. by @eladven in #1005
- Added seed to LLM as judges for consistent results by @yoavkatz in #1029
- Fixed issues with fresh install by @yoavkatz in #1037
- WML Inference Engine fix by @pawelknes in #1013
- replace type and type in type error message by @perlitz in #1035
- FinQA - filter problematic examples by @ShirApp in #1039
- demo's target prefix is now taken from demo instance by @dafnapension in #1031
- Make sure preparation times printed fully and nicely by @elronbandel in #1046
- Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
- Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
- Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
- Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
- Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
- Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
- fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055
Documentation changes
- Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
- Update llm_as_judge.rst by @yoavkatz in #1085
- Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
- Example improvements by @yoavkatz in #1022
- Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
- Fix some docs titles and links by @elronbandel in #1023
- Add example of meta evaluation of llm as judge by @yoavkatz in #1025
- Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
- Added example for selection of demos by @yoavkatz in #1052
New Contributors
We want to thank the new contributors for their first contributions!
- @welisheva22 made their first contribution in #1015
- @luisaadanttas made their first contribution in #994
- @benjaminsznajder made their first contribution in #1055
- @hanansinger made their first contribution in #1057