Skip to content

Releases: NatLibFi/Annif

Annif 0.56

01 Feb 12:18
v0.56.0
d439af1
Compare
Choose a tag to compare

This release introduces a new spaCy analyzer and takes care of many maintenance tasks. The CLI usage is improved by shortening the startup time of some commands, the Docker images are now easier to customize, there are improvements to the eval command, and minor bugs are fixed.

The spaCy analyzer enables support for some new languages and can improve subject suggestion results. The spaCy analyzer and the language-specific models need to be installed separately. The Docker image distributed via quay.io includes the spaCy analyzer and the English language model, but no other languages.

The maintenance tasks include upgrading many dependencies, notably Omikuji to v0.4. The Omikuji upgrade brings faster training and predictions as well as reduced memory usage, but the Annif projects using the omikuji backend need to be retrained. The projects using other backends should not require retraining, although warnings may be shown in some cases.

The support for Python 3.6 is removed, which is necessitated by the dependency upgrades.

This release also removes the Maui and vw_multi (Vowpal Wabbit) backends.

New features:
#374/#527/#563 spaCy analyzer

Improvements:
#514/#544 Optimize startup time using local & lazy imports
#548 Allow selecting installed optional dependencies in Docker build
#545/#558 Select metrics for eval command using an option
#546/#557 Output eval metrics as a JSON file compatible with DVC

Bug fixes:
#552/#554 LMDB can overflow (credit: @mo-fu)
#562 Add missing import of annif.eval in MLLM backend

Maintenance:
#549 Update dependencies for v0.56
#550 Drop Python 3.6 support
#541 Remove Maui and Vowpal Wabbit multi backends
#551 Remove swagger-tester dependency
#542/#555 Add CITATION.cff file
#553 Update Scrutinizer config
#561 Set a 10 minute timeout for GitHub Actions CI jobs
#565 Avoid coverage 6.3 as it causes some tests to hang

Annif 0.55

08 Nov 09:40
v0.55.0
e17754b
Compare
Choose a tag to compare

This release includes a new language filtering feature. This input-transform filters out sentences of the intput text whose language is different than the project language. The language detection is performed with Compact Language Detector v3 via pycld3. pycld3 is an optional dependency of Annif, see the installation page.

Also minor bug fixes and dependency updates are included.

The Maui and vw_multi (Vowpal Wabbit) backends have been marked as deprecated in this release and they will be removed in the next release 0.56. Removing is motivated by making codebase more compact and thus easier to maintain. The MLLM and nn_ensemble backends offer similar functionality as Maui and vw_multi.

Note that the notes for the previous release (Annif 0.54) initially missed to mention the added support for the input-transform feature.

New features:
#464/#507 Language filtering in input text

Improvements:
#536 Allow rdflib version 6.*

Bug fixes:
#533/#534 Adjust flask and click versions to avoid dependency mismatches

Maintenance:
#530 Add deprecation warning to Maui & vw_multi train commands
#492/#529 Update Docker base image to Debian Bullseye to upgrade Voikko library

Annif 0.54.1

06 Sep 13:37
a63f7d2
Compare
Choose a tag to compare

This is a patch release that fixes bugs surfaced and found after 0.54.0 release. In particular, installation using pip was not working correctly due to a missing dependency on the dateutil package.

Bugs fixed:
#523 Make Drone builds start on all git tag events
#524 Add MLLM classifier sanity check
#525 Much faster updating of existing large vocabulary
#528 Declare dateutil dependency

Annif 0.54

24 Aug 07:13
8ec6c75
Compare
Choose a tag to compare

This release adds a new --jobs parameter for the annif train command, which allows easy control of the number of threads/CPUs when training MLLM, fasttext and Omikuji backends. Many other improvements are included that speed up the MLLM backend, especially in the case of a large vocabulary. Also a few minor bugs have been fixed.

Edit: Also introduces support for adding new text-input transformation operations to Annif. Previously the input-limiting feature was implemented as a backend mechanism (#446, #452), which was set up in a project configuration e.g. with a setting input_limit=5000; now the input-limiting feature is implemented as a more general input-text transform and it can be set up in the project configuration with transform=limit(5000).

New features:
#512 Support jobs parameter in train command
Edit: #496 Support for adding input-transformation operations

Improvements:
#500 Implement custom MeanLayer in nn_ensemble
#511/#483 Process training docs in parallel in MLLM backend
#513/#519 Keep serialized dump of SKOS graph to save parsing time
#518 Use least frequent token as key in TokenSetIndex used by MLLM
#520 Optimize limit_mask creation

Bug fixes:
#510/#502 Use set as container of uris instead of list in DocumentFile
#515/#453 Allow NN ensemble to be used for parallel eval
#517 Skip unimportant subjects in _vector_to_list_suggestion
#522/#521 Allow private projects to be accessed from CLI

Annif 0.53.2

10 Aug 14:27
8b219da
Compare
Choose a tag to compare

This patch release includes the following changes:

  • #506 Fix NN ensemble training and learning on one-document corpus
  • #509 Warn instead of error in case of multiple subjects per doc in SVC training
  • #503 Fix read-the-docs documentation build error due to package conflict

Annif 0.53.1

01 Jul 07:46
c7297ff
Compare
Choose a tag to compare

This patch release fixes a bug which prevented training the SVC backend on fulltext corpus.

Annif 0.53

21 Jun 18:55
271da87
Compare
Choose a tag to compare

This release adds two new backends, YAKE and SVC. The YAKE backend is a wrapper around the YAKE library, which performs lexical unsupervised keyword extraction. There is no need for training data. See the YAKE wiki page for more information. In future Annif releases, it would be possible to extend YAKE support so that it can be used to suggest new terms for a vocabulary (the keywords that are not found in the vocabulary).

The SVC backend implements Linear Support Vector Classification. It is well suited for multiclass (but not multilabel) classification, for example classifying documents with the Dewey Decimal Classification or the 20 Newsgroups classification. It requires relatively little training data, and is suitable for classifications of up to around 10,000 classes. See the SVC wiki page for more information.

This release also upgrades many dependencies, which enables all Annif backends to run on Python 3.9 (previously nn_ensemble backend was available only for 3.6-3.8). The Docker image uses now Python 3.8 instead of 3.7.

Note that nn_ensemble models are not compatible across Python versions: e.g. a model trained on Python 3.7 can be used only on Python 3.7. Training the nn_ensemble models shows a CustomMaskWarning, but it is harmless (caused by a TensorFlow bug) and can be ignored.

Due to the update of scikit-learn, using TFIDF, MLLM or Omikuji models trained on older Annif versions will show warnings about the TfidfVectorizer. To the best of our knowledge, these are harmless and can be ignored. You have to retrain the models to get rid of the warnings.

This release includes also many minor improvements and bug fixes.

New features:
#486 New SVC (support vector classification) backend using scikit-learn
#439/#461 YAKE backend
#490/#494 Make --version option show Annif version

Improvements:
#488 Add support for ngram setting in omikuji backend

Maintenance:
#499 Update dependencies v0.53
#487 Upgrade scikit-learn to 0.24.2
#498 Update Dockerfile

Bug fixes:
#484/#495 Show error when training MLLM on empty corpus
#489 Add Codecov Action to GH workflow for uploading reports
#491 Raise NotSupportedException for attempt to train YAKE
#497 Remove execute permissions of some files

Annif 0.52

20 Apr 06:57
0436be2
Compare
Choose a tag to compare

This release includes a new MLLM backend which is a Python implementation of the Maui-like Lexical Matching algorithm. It was inspired by the Maui algorithm (by Alyona Medelyan), but not a direct reimplementation. It is meant for long full-text documents and like Maui, it needs to be trained with a relatively small number (hundreds or thousands) of manually indexed documents so that the algorithm can choose the right mix of heuristics that achieves best results on a particular document collection. See the MLLM Wiki page for more information.

New features include the possibility to configure two project parameters:

The STWFSA backend has been updated to use a newer version of the stwfsapy library. Old STWFSA models are not compatible with the new version so any STWFSA projects must be retrained. The release includes also several minor improvements and bug fixes.

New features:
#462 New lexical backend MLLM
#456/#468 Allow configuration of token min length (credit: mo-fu)
#475 Allow configuration of nn ensemble learning rate (credit: mo-fu)

Improvements:
#478/#479 Update stwfsa to 0.2.* (credit: mo-fu)
#472 Cleanup suggestion tests
#480 Optimize check for deprecated subject IDs using a set

Maintenance:
#474 Use GitHub Actions as CI service

Bug fixes:
#470/#471 Make sure suggestion scores are in the range 0.0-1.0
#477 Optimize the optimize command
#481 Backwards compatibility fix for the token_min_length setting
#482 MLLM fix: don't include use_hidden_labels in hyperopt, it won't have any effect

Annif 0.51

09 Feb 08:02
Compare
Choose a tag to compare

This release includes a new STWFSA backend which is a wrapper around STWFSAPY, a lexical algorithm based on finite state automata. It achieves best results with short texts, i.e., titles and author keywords, and is best suited for English language data.

The NN ensemble backend has been improved with better handling of source weights. Retraining NN ensemble models after updating Annif to this version is recommended, since the quality of results can decrease if old models are used. A new option for several CLI commands has been added: --docs-limit/-d option can be used to limit the number of documents to process, for example to create learning-curve data. Also several bugs have been fixed.

New features:
#438 Lexical STWFSAPY Backend (credit @mo-fu)
#465 Limit document number CLI option

Improvements:
#457/#458 Improved handling of source weights in NN ensemble

Bug fixes:
#454/#455 Address SonarCloud complaints
#459/#460 Pass limit parameter to Maui Server during train
#463 Fix TruncatingCorpus iterator

Annif 0.50

07 Dec 11:19
Compare
Choose a tag to compare

This release introduces a setting to use only a part of the input text for subject indexing: the new input_limit project parameter truncates the input text to the given character number. This can improve the quality of the suggestions as the beginning of a long document typically includes an abstract and introduction. The default value for input_limit is zero, which means that truncation is not performed.

Improvements include better handling of cached data in nn_ensemble training and optimization of memory usage in evaluation by using sparse matrices for suggested subjects. Many dependencies have been updated and a few minor issues fixed.

New features:
#446 Add a backend paratemer to limit input characters in suggest
#452 Apply the input_limit backend parameter to texts in train & learn

Improvements:
#441 Sparse subjects (credit @mo-fu)
#443/#444 Allow use of cached data after cancelled training of nn_ensemble backend

Maintenance:
#448 Upgrade dependencies
#445 Upgrade LMDB dependency from 0.98 to 1.0.0
#449 Resolve DeprecationWarning: change warn to warning

Bug fixes:
#447 Fix missing default params in pav and nn ensemble