-
Notifications
You must be signed in to change notification settings - Fork 41
Analyzers
Analyzers are used to pre-process, tokenize and normalize text. If you happen to be familiar with analyzers in Lucene, Solr and/or Elasticsearch, the concept is exactly the same although the details may differ a little bit. Analyzers are typically language-specific.
By default the tokenization discards all words that are shorter than three characters, but this can be configured by setting token_min_length
in the analyzer parameters. For example, to discard only words of one character (when using the snowball
analyzer for English), use snowball(english,token_min_length=2)
.
Annif supports many analyzers: simple
, snowball
, simplemma
, voikko
, spacy
and estnltk
.
The simple
analyzer only splits text into words and turns them all into lowercase.
The snowball
analyzer additionally performs stemming. It takes a language name as parameter, e.g. snowball(english)
or snowball(finnish)
. You can use any language supported by the NLTK Snowball stemmer; see the NLTK stemmer documentation for details on supported languages.
The supported languages as of NLTK 3.4.5 are:
arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish
The simplemma
analyzer performs simple rule-based lemmatization for many languages. It takes a language code as parameter, e.g. simplemma(en)
. Lemmatization gives better results than stemming in many cases, but this depends on the language and classification task.
The voikko
analyzer performs lemmatization for Finnish. It takes a language code as parameter, e.g. voikko(fi)
. This analyzer needs to be installed separately. See Optional features and dependencies
The spacy
analyzer performs lemmatization for many languages using the spaCy NLP toolkit. See Models & Languages for the current list of supported languages.
The analyzer takes a language model name as parameter, e.g. spacy(en_core_web_sm)
. Optionally, lemmas can be forced to lowercase using the lowercase
option, like this: spacy(en_core_web_sm,lowercase=1)
This analyzer and the language-specific models need to be installed separately. See Optional features and dependencies
The estnltk
analyzer (new in Annif 1.3) performs lemmatization for the Estonian language using the EstNLTK toolkit. It takes no parameters.
This analyzer needs to be installed separately. See Optional features and dependencies
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend