1.10.1
Main Changes
- Continued with major improvements to the documentation including a new code examples section with standalone python code that shows how to perform evaluation, add new datasets, compare formats, use LLM as judges , and more. Cards for datasets from huggingface have detailed descriptions. New documentation of RAG tasks and metrics.
load_dataset
can now load cards defined in a python file (and not only in the catalog). See example.- The evaluation results returned from
evaluate
now include two fieldspredictions
andprocessed_predictions
. See example. - The fields can have defaults, so if they are not specified in the card, they get a default value. For example, multi-class classification has
text
as the defaulttext_type
. See example.
Non backward compatible changes
You need to recreate the any cards/metrics you added by running prepare//.py file. You can create all cards simply by running python utils/prepare_all_artifacts.py . This will avoid the type error.
The AddFields operator was renamed Set and CopyFields operator was renamed Copy. Note previous code should continue to work, but we renamed all existing code in the unitxt and fm-eval repos.
- Change Artifact.type to Artifact.type by @elronbandel in #933
- change CopyFields operators name to Copy by @duckling69 in #876
- Rename AddFields to Set, a name that represent its role better and concisely by @elronbandel in #903
New Features
- Allow eager execution by @elronbandel in #888
- Add view option for Task definitions in UI explorer. by @yoavkatz in #891
- Add input type checking in LoadFromDictionary by @yoavkatz in #900
- Add TokensSlice operator by @elronbandel in #902
- Make some logs critical by @elronbandel in #973
- Add LogProbInferenceEngines API and implement for OpenAI by @lilacheden in #909
- Added support for ibm-watsonx-ai inference by @pawelknes in #961
- load_dataset supports loading cards not present in local catalog by @pawelknes in #929
- Added defaults to tasks by @pawelknes in #921
- Add raw predictions and references to results by @yoavkatz in #934
- Allow add-hoc metrics and template (and Add first version of standalone example of dataset with LLM as a judge ) by @eladven in #922
- Add infer() function for end to end inference pipeline by @elronbandel in #952
Bug Fixes
- LLMaaJ implementation of MLCommons' simple-safety-tests by @bnayahu in #873
- Update gradio version on website by @elronbandel in #896
- Improve demo by @elronbandel in #898
- Fix demo and organize files by @elronbandel in #897
- Make sacrebleu robust by @yoavkatz in #892
- Fix huggingface assets to have versions and up to date readme by @elronbandel in #895
- fix(cos loader): account for slashes in cos file name by @jezekra1 in #904
- llama3 instruct and chat system prompts by @oktie in #950
- Added trust_remote_code to HF dataset query operations by @yoavkatz in #911
Documentation
- Update llm_as_judge.rst by @yoavkatz in #970
- Michal Jacovi's completed manual review of the card descriptions by @dafnapension in #883
- In card preparers, generate the tags with "singletons" rather than values paired with True by @dafnapension in #874
- Improved documentation by @yoavkatz in #886
- Update glossary.rst by @yoavkatz in #899
- Add example section to documentation by @yoavkatz in #917
- Added example of open qa using catalog by @yoavkatz in #919
- Update example intro and simplified WNLI cards by @yoavkatz in #923
- Update adding_metric.rst by @yoavkatz in #955
- RAG documentation by @yoavkatz in #928
- docs: update adding_dataset.rst by @eltociear in #927
- prepare for description= that is different from those embedded automtically by @dafnapension in #937
- Add simple LLM as a judge example, of using it without installaiotn by @eladven in #968
- Add example of using LLM as a judge for summarization dataset. by @eladven in #965
- Improve operators documentation by @elronbandel in #942
New Assets
- Add numeric nlg dataset by @ShirApp in #882
- Add to_list_by_hyphen_space processor by @marukaz in #872
- Added tags and descriptions to safety cards by @bnayahu in #887
- Add Mt-Bench datasets + add operators by @OfirArviv in #870
- Touch up numeric nlg by @elronbandel in #889
- split train to train and validation sets in billsum by @alonh in #901
- modified wikitq, tab_fact taskcards by @ShirApp in #963
Implementation of TruthfulQA by @bnayahu in #931 - Add bluebench cards by @perlitz in #918
- Add LlamaIndex faithfulness metric by @arielge in #971
- Expanded template support for safety cards by @bnayahu in #943
Testing and CI/CD
- Add end to end realistic test to fusion by @elronbandel in #940
- Moved test_examples to run the actual examples by @yoavkatz in #913
- Use uv for installing requirements in actions by @elronbandel in #960
- Add ability to print_dict to print selected fields by @yoavkatz in #947
- Get rid of pkg_resources dependency by @elronbandel in #932
- adapt filtering lambda to datasets 2.20 by @dafnapension in #930
- Increase preparation log to error. by @elronbandel in #959
New Contributors
Full Changelog: 1.10.0...1.10.1