Release 1.12.3 · IBM/unitxt

Main changes

New option to use multiple templates and/or num_demos in single dataset recipe. Unitxt will randomly sample from the provided templates and possible number of demos for each instance.
See example : https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates_num_demos.py
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it
See more details on how to deal with conflicting metric names in https://www.unitxt.ai/en/latest/docs/adding_metric.html#metric-outputs-with-multiple-metrics

change rag metrics name convention (e.g. "metrics.rag.mrr" -> "metrics.rag.context_correctness.mrr",) - catalog non backward compatible change by @assaftibm in #1104
Update summarization task and templates to support multiple reference summaries - by @yoavkatz in #1126
Fix belebele due to new convention by @elronbandel in #1145

Add DeepSeek-Coder format and system prompt by @oktie in #1105
Add a metric to calculate the ratio of references included in the prediction by @marukaz in #1091
adding RAG bge metrics by @assaftibm

Add option to run multiple templates and or num_demos in single dataset recipe. Now it is possible to give a list of templates or num_demos. Unitxt will randomly sample from the templates and for each instance assign a random template from the list. by @elronbandel in #1110
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it @dafnapension in #1124
MetricPipeline fields postpreprocess_steps has been renamed to postprocess_steps. The old field (postpreprocess_steps) still exists for backward compatible but depricated. by @dafnapension in #1117
Decrease runtime of demo examples
Add tests for RAG metrics by @matanor
Adding dedicated Unitxt warning and error classes to link online documentation by @yoavkatz in
The code now uses a central controllable deepcopy function by @elronbandel in #1120

Create a dedicated nltk a mixin, for downloading all versions of punkt which needed by metrics code. by @elronbandel in #1151
For bulk instance metrics, Replace mean function with nanmean to support aggregation in case of nan scores. by @elronbandel in #1150
Fix helm test by @elronbandel in #1109
Fix bug with RAG metrics: Fix use of minilm model by @assaftibm in #1115
Fix data classification of WML model to include 'public' classification by @yoavkatz in #1118
Fix WMLInferenceEngine by @pawelknes in #1122
Fix belebele HF path due to new convention by @elronbandel in #1145