Skip to content

Release 0.5.1

Compare
Choose a tag to compare
@bminixhofer bminixhofer released this 31 Mar 16:41
· 27 commits to main since this release

Breaking changes

  • Changes the focus from Vec<Token> to Sentence (#54). pipe and sentencize return iterators over Sentence / IncompleteSentence now.
  • Removes the special SENT_START token (now only used internally). Each token corresponds to at least one character in the input text now.
  • Makes the fields of Token and IncompleteToken private and adds getter methods (#54).
  • char_span and byte_span are replaced by a Span struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, use token.span().byte().
  • Spans are relative to the input text now, not anymore to sentence boundaries (#53, thanks @drahnr!).

New features

  • The regex backend can now be chosen from Oniguruma or fancy-regex with the features regex-onig and regex-fancy. regex-onig is the default.
  • nlprule now compiles to WebAssembly. WebAssembly support is guaranteed for future versions and tested in CI.
  • A new selector API to select individual rules (details documented in nlprule::rule::id). For example:
use nlprule::{Tokenizer, Rules, rule::id::Category};
use std::convert::TryInto;

let mut rules = Rules::new("path/to/en_rules.bin")?;

// disable rules named "confusion_due_do" in category "confused_words"
rules
   .select_mut(
       &Category::new("confused_words")
           .join("confusion_due_do")
           .into(),
   )
   .for_each(|rule| rule.disable());

// disable all grammar rules
rules
   .select_mut(&Category::new("grammar").into())
   .for_each(|rule| rule.disable());

// a string syntax where slashes are the separator is also supported
rules
   .select_mut(&"confused_words/confusion_due_do".try_into()?)
   .for_each(|rule| rule.enable());