Skip to content

Unitxt 1.7.6

Compare
Choose a tag to compare
@elronbandel elronbandel released this 08 Apr 17:49
· 511 commits to main since this release
b76022c

What's Changed

The most significat change in this release is the addition of the notion of \N (slash capital N) to formats. With \N you can define places where you want a single new line removing all newlines ahead.

A very detailed explanation if you want to go deeper:

The Capital New Line Notation (\N) transforms a given string by applying the Capital New Line Notation.
The Capital New Line Notation (\N) is designed to manage newline behavior in a string efficiently.
This custom notation aims to consolidate multiple newline characters (\n) into a single newline under
specific conditions, with tailored handling based on whether there's preceding text. The function
distinguishes between two primary scenarios:
1. If there's text (referred to as a prefix) followed by any number of \n characters and then one or
more \N, the entire sequence is replaced with a single \n. This effectively simplifies multiple
newlines and notation characters into a single newline when there's preceding text.
2. If the string starts with \n characters followed by \N without any text before this sequence, or if
\N is at the very beginning of the string, the sequence is completely removed. This case is
applicable when the notation should not introduce any newlines due to the absence of preceding text.

This allows us two things:
First define system formats that are not having unnecassry new lines when instruciton of system prompt are missing.
Second, to ignore any new lines created by the template ensuring the number of new lines will be set by the format only.

For example if we defined the system format in the following way:

from unitxt.formats import SystemFormat

format = SystemFormat(model_input_format="{system_prompt}\n{instruction}\n|user|\n{source}\n|assistant|\n{target_prefix}")

We faced two issues:

  1. If the system prompt is empty or the instruction is empty we have two trailing new lines for no reason.
  2. If the source finished with new line (mostly due to template structre) we would have unnecassry empty line before the "|user|"

Both problems are solved with \N notation:

from unitxt.formats import SystemFormat

format = SystemFormat(model_input_format="{system_prompt}\\N{instruction}\\N|user|\n{source}\\N|assistant|\n{target_prefix}")

Breaking changes

  • Fix typo in MultipleChoiceTemplate field choices_seperator -> choices_separator
  • Deprecation of use_query option in all operators , for now it is just raising warning but will be removed in the next major release. The new default behavior is equivalent to use_query=True.

All Changes

Bug Fixes:

Assets Fixes:

New Features:

  • Add notion of \N to formats, to fix format new line clashes by @elronbandel in #751
  • Ability to dynamically change InstanceMetric inputs + grammar metrics by @arielge in #736
  • Add DeprecatedFIeld for more informative procedure for deprecating fields of artifacts by @dafnapension in #741

New Assets:

  • Add rerank recall metric to unitxt by @jlqibm in #662
  • Add many selection and human preference tasks and datasets by @elronbandel in #746
  • Adding Detector metric for running any classifier from huggingface as a metric by @mnagired in #745
  • Add operators: RegexSplit, TokensSplit, Chunk by @elronbandel in #749
  • Add bert score large and base versions by @assaftibm in #748

Enhancments:

New Contributors

Full Changelog: 1.7.4...1.7.6