Contributing to Annif

Contributions to Annif are very welcome!

This document aims to give you some helpful information when you wish to participate in Annif development.

Typically you contribute by opening a new issue or offering modifications to the codebase. Generally, in the case of non-trivial modifications, before submitting a pull-request (PR) it is probably best to first discuss the topic in an issue.

When creating an issue, whether it is for a feature request/proposal, a bug report or a question, you should first search the existing issues (both open and closed) for your topic. Feel free to comment existing issues to offer new details, ideas, opinions etc.

However, note that if you have a general question about Annif or its usage in some specific scenario or with a data-set, please consider using the annif-users mailing list (Google Groups) instead of opening an issue.

Creating an issue

If you don't find an existing issue for your topic, open a new one. Please be clear in the title and description, and provide all necessary information. In the case of a bug report the provided information should aim to give a minimal reproducible example of the problem.

For readability, please format code snippets as code and also use other markdown formatting where found appropriate.

Contributing code

If you see an issue that you'd like to fix feel free to do so. If possible let us know you're working on an issue by leaving a comment on it so we'll be able to avoid doing the same work twice. This is especially useful if the issue has been marked for a release (in a milestone with a version number) since it's more likely someone might be already working on it.

Installation for development

See Development install in README.md or use the Docker image for development.

Development flow

The development of Annif follows GitHub flow. Feel free to fork the Annif-repository for your changes. Some basic principles:

The main branch is always a working, deployable version of Annif. The code on the main branch will eventually be released as the next release.
All development happens on feature branches, whether branched from NatLibFi's origin or from a fork. Feature branches are normally named according to the issue they are addressing: e.g. issue267-cli-analyze-to-suggest which implements the change specified in issue #267.
Feature branches are merged via pull requests. Opening a pull request signals the other developers that the feature is ready to be discussed and eventually merged. Pull requests should be marked with draft status if the developer knows that the code is not yet ready for merging but wants to start discussion. Also, various checks (tests in GitHub Actions, test coverage tools and static analyzer services) are run on pull requests and these may provide important feedback to the developer.
The pull request should have a clear description of the included changes, and if the PR is modified later, the description should be updated. Include a linking keyword targeting an issue when applicable, so when the PR is merged, the issue is automatically closed.
Feature branches should be deleted after the pull request has been merged.
A new release is made whenever some important changes have landed in main. Releases are intended to be frequent. See Release process for the details of making a release.

Commits

Try to produce a commit history that is easy to follow with meaningful commit messages. See commit best practices e.g. in here.

Branches

At any time, these branches typically exist:

the main branch
feature branches under development
experimental branches that are not under active development but which we don't want to delete in case the code is later needed

Unit tests

Generally, the aim is to cover every line of the codebase with the unit tests. If you've added new functionality or you've found out that the existing tests are lacking, we'd be happy if you could provide additional tests to cover it. The development dependencies include pytest, which you can execute in the project root to run the unit tests:

pytest

To run only a subset of tests, you can pass a path to a tests file as an argument, e.g.: pytest tests/test_analyzer.py. Also flake8 checks are run together with the unit tests. It is best to verify that the unit tests pass locally before pushing commits to GitHub repository.

When a (draft) PR is opened or new commits are pushed to a branch belonging to a PR, the unit tests for the code are run in the GitHub Actions CI/CD pipeline. The tests are run on all the minor versions of Python that Annif aims to support with varying configurations of the optional dependencies, see the cicd.yaml for the pipeline setup.

Code style

Annif code should follow the Black style and import statements should be grouped and ordered. To achieve this, the Black and isort tools are included as development dependencies; you can run black . and isort . in the project root to autoformat code. These tools together with flake8 are run also in GitHub Actions CI/CD pipeline checking the code style compliance.

You can set up a pre-commit hook to automate linting with isort, Black and flake8 with every git commit by using the following in the file .git/hooks/pre-commit, which should have execute permission set:

#!/bin/bash

set -e

isort . --check-only --diff
black . --check --diff
flake8

If the hook complains and intercepts the commit, you can run isort . and/or black . for an automatical fix.

Alternatively, you can set a pre-commit hook to also autoformat code using the pre-commit framework, and configure it to use Black and isort.

Other points:

Names of the identifiers in the code (variables, functions, classes etc.) should be meaningful. Do not use names of only single character.
Write docstrings for the entities you create. They end up in the Annif's internal API documentation.

Creating a new backend

Annif backend code is in the annif/backend module. Each backend is implemented as a subclass of AnnifBackend, or its more specific subclass AnnifLearningBackend (for backends that support online learning) or BaseEnsembleBackend (for backends that combine results from multiple projects).

A backend can define these key fields and methods:

name: field for a name for the backend (a single word, all lowercase)
initialize (optional): method setting up the necessary internal data structures
_train (optional): method for training the model on a given document corpus
_suggest: method for feeding a single document (text) and getting suggested subjects for it
_suggest_batch: method for feeding a batch of documents (texts) and getting suggested subjects for each of them

It is only necessary to implement either _suggest or _suggest_batch, but not both. Processing batches is often more efficient and should be used if possible.

Learning backends additionally implement:

_learn: method for continuing training the model on the given corpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Contributing to Annif

Creating an issue

Contributing code

Installation for development

Development flow

Commits

Branches

Tags

Unit tests

Code style

Creating a new backend

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Annif

Creating an issue

Contributing code

Installation for development

Development flow

Commits

Branches

Tags

Unit tests

Code style

Creating a new backend