Breaking changes
- Changes the focus from
Vec<Token>
to Sentence
(#54). pipe
and sentencize
return iterators over Sentence
/ IncompleteSentence
now.
- Removes the special
SENT_START
token (now only used internally). Each token corresponds to at least one character in the input text now.
- Makes the fields of
Token
and IncompleteToken
private and adds getter methods (#54).
char_span
and byte_span
are replaced by a Span
struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, use token.span().byte()
.
- Spans are relative to the input text now, not anymore to sentence boundaries (#53, thanks @drahnr!).
New features
- The regex backend can now be chosen from Oniguruma or fancy-regex with the features
regex-onig
and regex-fancy
. regex-onig
is the default.
- nlprule now compiles to WebAssembly. WebAssembly support is guaranteed for future versions and tested in CI.
- A new selector API to select individual rules (details documented in
nlprule::rule::id
). For example:
use nlprule::{Tokenizer, Rules, rule::id::Category};
use std::convert::TryInto;
let mut rules = Rules::new("path/to/en_rules.bin")?;
// disable rules named "confusion_due_do" in category "confused_words"
rules
.select_mut(
&Category::new("confused_words")
.join("confusion_due_do")
.into(),
)
.for_each(|rule| rule.disable());
// disable all grammar rules
rules
.select_mut(&Category::new("grammar").into())
.for_each(|rule| rule.disable());
// a string syntax where slashes are the separator is also supported
rules
.select_mut(&"confused_words/confusion_due_do".try_into()?)
.for_each(|rule| rule.enable());