Rule-based sentence scoring algorithm
This Python package provides a GDEX-based algorithm for evaluating sentences with regard to their suitability as good examples in dictionaries. It applies a numeric score between zero and one to sentences which have been preprocessed with the NLP tool spaCy. The score is computed by taking several configurable criteria into account, firstly knock-out criteria which have to be fulfilled in order to reach a score above zero at all, as well as gradual criteria that factor into a score greater than zero.
Among the knock-out criteria are
- the character set of a sentence not containing any invalid ones (i. e. control characters),
- properly parsed sentences with punctuation at the end, and
- the existence of a finite verb and a subject, annotated and related in a sentence's dependency parse tree.
Among the gradual criteria are
- the absence of blacklisted words (i. e. vulgar or obscene),
- the absence of rare characters or those normally not available on a keyboard,
- the absence of named entities,
- the absence of deictic expressions,
- an optimal length of the sentence, and
- a whitelist-based coverage test, i. e. for penalizing usage of rare lemmata.
gdex
can be installed as a package from its GitHub source repository:
pip install git+https://github.com/zentrum-lexikographie/gdex.git
For development, clone it from GitHub and install it locally, including optional dependencies:
pip install -e .[dev]
>>> import spacy
>>> import gdex
>>> nlp = spacy.load("de_core_news_sm")
>>> [s._.gdex for s in gdex.de_core(nlp("Achtung! Das ist ein toller Test.")).sents]
[0.0, 0.5322]
Run tests, including calculation of code coverage:
coverage run -m pytest
This package was initially developed as part of the EVIDENCE project and funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, GU 798/27-1; GE 1119/11-1). Between August 2023 and October 2024, it has been maintained by Ulf Hamster.
This implementation makes use of VulGer, a lexicon covering words from the lower end of the German language register — terms typically considered rough, vulgar, or obscene. VulGer is used under the terms of the CC-BY-SA license.
- Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell and Pavel Rychlý. GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the 13th EURALEX International Congress. Spain, July 2008, pp. 425–432.
- Didakowski, Jörg, Lothar Lemnitzer, and Alexander Geyken. Automatic example sentence extraction for a contemporary German dictionary. Proceedings EURALEX. 2012.
- Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. At the Lower End of Language—Exploring the Vulgar and Obscene Side of German. In Proceedings of the Third Workshop on Abusive Language Online, pages 119–128, Florence, Italy. Association for Computational Linguistics.