Skip to content

zentrum-lexikographie/gdex

Repository files navigation

gdex

Rule-based sentence scoring algorithm

This Python package provides a GDEX-based algorithm for evaluating sentences with regard to their suitability as good examples in dictionaries. It applies a numeric score between zero and one to sentences which have been preprocessed with the NLP tool spaCy. The score is computed by taking several configurable criteria into account, firstly knock-out criteria which have to be fulfilled in order to reach a score above zero at all, as well as gradual criteria that factor into a score greater than zero.

Among the knock-out criteria are

  • the character set of a sentence not containing any invalid ones (i. e. control characters),
  • properly parsed sentences with punctuation at the end, and
  • the existence of a finite verb and a subject, annotated and related in a sentence's dependency parse tree.

Among the gradual criteria are

  • the absence of blacklisted words (i. e. vulgar or obscene),
  • the absence of rare characters or those normally not available on a keyboard,
  • the absence of named entities,
  • the absence of deictic expressions,
  • an optimal length of the sentence, and
  • a whitelist-based coverage test, i. e. for penalizing usage of rare lemmata.

Installation

gdex can be installed as a package from its GitHub source repository:

pip install git+https://github.com/zentrum-lexikographie/gdex.git

For development, clone it from GitHub and install it locally, including optional dependencies:

pip install -e .[dev]

Usage

>>> import spacy
>>> import gdex
>>> nlp = spacy.load("de_core_news_sm")
>>> [s._.gdex for s in gdex.de_core(nlp("Achtung! Das ist ein toller Test.")).sents]
[0.0, 0.5322]

Testing

Run tests, including calculation of code coverage:

coverage run -m pytest

Acknowledgements

This package was initially developed as part of the EVIDENCE project and funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, GU 798/27-1; GE 1119/11-1). Between August 2023 and October 2024, it has been maintained by Ulf Hamster.

This implementation makes use of VulGer, a lexicon covering words from the lower end of the German language register — terms typically considered rough, vulgar, or obscene. VulGer is used under the terms of the CC-BY-SA license.

Bibliography