SFST/SMOR/DWDS-based German morphology
This project provides a component for the lemmatisation and morphological analysis of word forms as well as for the generation of paradigms of lexical words in written German. To this end we adopt:
- SFST, a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006)
- SMORLemma (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology (SMOR) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component
- the DWDS dictionary (BBAW n.d.) replacing the IMSLex (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphological properties.
This repository provides source code for building DWDSmor lexica and transducers as well as for using DWDSmor transducers for morphological analysis and paradigm generation:
share/
contains XSLT stylesheets for extracting lexical entries in SMORLemma format form XML sources of DWDS articles. Sample inputs and outputs can be found insamples/
.lexicon/dwds/
contains scripts for building DWDSmor lexica by means of the XSLT stylesheets inshare/
and DWDS sources inlexicon/dwds/wb/
, which are not part of this repository.lexicon/sample/
contains scripts for building sample DWDSmor lexica by means of the XSLT stylesheets inshare/
and the sample lexicon inlexicon/sample/wb/
.grammar/
contains an FST grammar derived from SMORLemma, providing the morphology for building DWDSmor automata from DWDSmor lexica.test/
implements a test suite for the DWDSmor transducers.dwdsmor.py
andparadigm.py
are user-level Python scripts for morphological analysis and for paradigm generation by means of DWDSmor transducers.
DWDSmor is in active development. In its current stage, DWDSmor supports most
inflection classes and some productive word-formation patterns of written
German. Note that the sample lexicon in lexicon/sample/wb/
only covers a
sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled
from it.
GNU/Linux : Development, builds and tests of DWDSmor are performed on Debian GNU/Linux. While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported.
Python >= v3.9
: DWDSmor targets Python as its primary runtime environment. The DWDSmor
transducers can be used via SFST's commandline tools, queried in Python
applications via language-specific
bindings, or used by the Python
scripts dwdsmor.py
and paradigm.py
for morphological analysis and for
paradigm generation.
Saxon-HE : The extraction of lexical entries from XML sources of DWDS articles is implemented in XSLT 2, for which Saxon-HE is used as the runtime environment.
Java (JDK) >= v8 : Saxon requires a Java runtime.
SFST : a C++ library and toolbox for finite-state transducers (FSTs); please take a look at its homepage for installation and usage instructions.
On a Debian-based distribution, install the following packages:
apt install python3 default-jdk libsaxonhe-java sfst
Set up a virtual environment for project builds via Python's venv
:
python3 -m venv .venv
source .venv/bin/activate
Then run the DWDSmor setup routine in order to install Python dependencies:
make setup
For building DWDSmor lexica and transducers, run:
make all
Alternatively, you can run:
make dwds && make dwds-install && make dwdsmor
Note that these commands require DWDS sources in lexicon/dwds/wb/
, which are
not part of this repository.
Alternatively, you can build sample DWDSmor lexica and transducers from the
sample lexicon in lexicon/sample/wb/
by running:
make sample && make sample-install && make dwdsmor
After building DWDSmor transducers, install them into lib/
, where the
user-level Python scripts dwdsmor.py
and paradigm.py
expect them by default:
make install
The installed DWDSmor transducers are:
lib/dwdsmor.{a,ca}
: transducer with inflection and word-formation components, for lemmatisation and morphological analysis of word forms in terms of grammatical categorieslib/dwdsmor-morph.{a,ca}
: transducer with inflection and word-formation components, for the generation of morphologically segmented word formslib/dwdsmor-finite.{a,ca}
: transducer with an inflection component and a finite word-formation component, for testing purposeslib/dwdsmor-root.{a,ca}
: transducer with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022)lib/dwdsmor-index.{a,ca}
: transducer with an inflection component only with DWDS homographic lemma indices, for paradigm generation
The installed DWDSmor transducers can be examined with the test suite in
test/
. It provides coverage tests and regression tests.
The coverage tests are run with the following command:
make test-coverage
Coverage test reports and statistics are saved as TSV tables in test/reports/
and test/summaries/
, respectively.
Individual coverage tests can be run by calling test/Makefile
as below:
make -C test test-dwds-lemma-coverage
make -C test test-sample-lemma-coverage
make -C test test-tuebadz-lemma-coverage
make -C test test-dwds-paradigm-coverage
make -C test test-sample-paradigm-coverage
The test-dwds-lemma-coverage
and test-dwds-paradigm-coverage
targets of
test/Makefile
require DWDS sources in lexicon/dwds/wb/
(not part of this
repository). The test-tuebadz-lemma-coverage
target presupposes a TüBa-D/Z
treebank export tuebadz-11.0-exportXML-v2.xml
at test/data/tuebadz/
(likewise not part of this repository).
Note that runs of the test-dwds-paradigm-coverage
and
test-sample-paradigm-coverage
targets of test/Makefile
may take a
considerable amount of time.
Regression tests compare generated test results to saved snapshots in
test/reports/
. To create the snapshots, first run:
make test-snapshot
Then, in order to test for regressions which may arise from changes of lexicon, grammar, or user-level scripts, run:
make test-regression
Regression test targets can also be run individually by calling test/Makefile
as follows:
make -C test test-analysis-snapshot
make -C test test-paradigm-snapshot
make -C test test-analysis-regression
make -C test test-paradigm-regression
DWDSmor provides two Python scripts for using the DWDSmor transducers.
dwdsmor.py
is a Python script for the lemmatisation and morphological analysis
of word forms in written German by means of a DWDSmor transducer:
$ ./dwdsmor.py -h
usage: dwdsmor.py [-h] [-a] [-c] [-C] [-E] [-H] [-i] [-I] [-j] [-m] [-M] [-P] [-s] [-S]
[-t TRANSDUCER] [-T TRANSDUCER2] [-v] [-w] [-W] [-y] [input] [output]
positional arguments:
input input file (one word form per line; default: stdin)
output output file (default: stdout)
options:
-h, --help show this help message and exit
-a, --analysis-string
output also analysis string
-c, --csv output CSV table
-C, --force-color preserve color and formatting when piping output
-E, --no-empty suppress empty columns or values
-H, --no-header suppress table header
-i, --lemma-index output also homographic lemma index
-I, --paradigm-index output also paradigm index
-j, --json output JSON object
-m, --minimal prefer lemmas with minimal number of boundaries
-M, --maximal prefer word forms with maximal number of boundaries (requires supplementary transducer file)
-P, --plain suppress color and formatting
-s, --seg-lemma output also segmented lemma
-S, --seg-word output also segmented word form (requires supplementary transducer file)
-t TRANSDUCER, --transducer TRANSDUCER
path to transducer file in compact format (default: lib/dwdsmor.ca)
-T TRANSDUCER2, --transducer2 TRANSDUCER2
path to supplementary transducer file in standard format (default: lib/dwdsmor-morph.a)
-v, --version show program's version number and exit
-w, --wf-process output also word-formation process
-W, --wf-means output also word-formation means
-y, --yaml output YAML document
By default, dwdsmor.py
prints a TSV table on standard output:
$ echo "Ihr\nkönnt\neuch\nauf\nden\nKinderbänken\nausruhen\n." | ./dwdsmor.py -E
Wordform Lemma POS Subcategory Person Gender Case Number Inflection Function Nonfinite Mood Tense Metalinguistic Characters
Ihr Ihre POSS Neut Acc Sg NoInfl Attr
Ihr Ihre POSS Neut Nom Sg NoInfl Attr
Ihr Ihre POSS Masc Nom Sg NoInfl Attr
Ihr ihr PPRO Pers 2 Nom Pl CAP
Ihr ihre POSS Neut Acc Sg NoInfl Attr CAP
Ihr ihre POSS Neut Nom Sg NoInfl Attr CAP
Ihr ihre POSS Masc Nom Sg NoInfl Attr CAP
Ihr Sie PPRO Pers 3 NoGend Gen Pl Old
Ihr sie PPRO Pers 3 NoGend Gen Pl Old CAP
Ihr sie PPRO Pers 3 Fem Dat Sg CAP
Ihr sie PPRO Pers 3 Fem Gen Sg Old CAP
könnt können V 2 Pl Ind Pres
euch euch PPRO Refl 2 Acc Pl
euch euch PPRO Refl 2 Dat Pl
euch ihr PPRO Pers 2 Acc Pl
euch ihr PPRO Pers 2 Dat Pl
auf auf ADV
auf auf PREP
den die REL Masc Acc Sg St Subst
den die DEM Masc Acc Sg St Subst
den die DEM NoGend Dat Pl St Attr
den die DEM Masc Acc Sg St Attr
den die ART Def Masc Acc Sg St Subst
den die ART Def NoGend Dat Pl St Attr
den die ART Def Masc Acc Sg St Attr
Kinderbänken Kinderbank NN Fem Dat Pl
ausruhen ausruhen V Inf
ausruhen ausruhen V 3 Pl Subj Pres
ausruhen ausruhen V 3 Pl Ind Pres
ausruhen ausruhen V 1 Pl Subj Pres
ausruhen ausruhen V 1 Pl Ind Pres
. . PUNCT Period
The transducer can be selected as an argument of option -t
:
$ echo "Ihr\nkönnt\neuch\nauf\nden\nKinderbänken\nausruhen\n." | ./dwdsmor.py -E -t lib/dwdsmor-root.ca
Wordform Lemma POS Subcategory Person Gender Case Number Inflection Function Nonfinite Mood Tense Metalinguistic Characters
Ihr Ihre POSS Neut Acc Sg NoInfl Attr
Ihr Ihre POSS Neut Nom Sg NoInfl Attr
Ihr Ihre POSS Masc Nom Sg NoInfl Attr
Ihr ihr PPRO Pers 2 Nom Pl CAP
Ihr ihre POSS Neut Acc Sg NoInfl Attr CAP
Ihr ihre POSS Neut Nom Sg NoInfl Attr CAP
Ihr ihre POSS Masc Nom Sg NoInfl Attr CAP
Ihr Sie PPRO Pers 3 NoGend Gen Pl Old
Ihr sie PPRO Pers 3 NoGend Gen Pl Old CAP
Ihr sie PPRO Pers 3 Fem Dat Sg CAP
Ihr sie PPRO Pers 3 Fem Gen Sg Old CAP
könnt können V 2 Pl Ind Pres
euch euch PPRO Refl 2 Acc Pl
euch euch PPRO Refl 2 Dat Pl
euch ihr PPRO Pers 2 Acc Pl
euch ihr PPRO Pers 2 Dat Pl
auf auf ADV
auf auf PREP
den die REL Masc Acc Sg St Subst
den die DEM Masc Acc Sg St Subst
den die DEM NoGend Dat Pl St Attr
den die DEM Masc Acc Sg St Attr
den die ART Def Masc Acc Sg St Subst
den die ART Def NoGend Dat Pl St Attr
den die ART Def Masc Acc Sg St Attr
Kinderbänken Kind + Bank NN Fem Dat Pl
ausruhen ruhen V Inf
ausruhen ruhen V 3 Pl Subj Pres
ausruhen ruhen V 3 Pl Ind Pres
ausruhen ruhen V 1 Pl Subj Pres
ausruhen ruhen V 1 Pl Ind Pres
ausruhen ausruhen V Inf
ausruhen ausruhen V 3 Pl Subj Pres
ausruhen ausruhen V 3 Pl Ind Pres
ausruhen ausruhen V 1 Pl Subj Pres
ausruhen ausruhen V 1 Pl Ind Pres
. . PUNCT Period
CSV, JSON, and YAML outputs are available with options -c
, -j
, and -y
respectively.
paradigm.py
is Python script for the generation of paradigms of lexical words
in written German by means of a DWDSmor transducer:
$ ./paradigm.py -h
usage: paradigm.py [-h] [-c] [-C] [-E] [-H] [-i {1,2,3,4,5,6,7,8}] [-I {1,2,3,4,5,6,7,8}] [-j] [-n] [-N]
[-o] [-O] [-p {ADJ,ART,CARD,DEM,FRAC,INDEF,NN,NPROP,ORD,POSS,PPRO,REL,V,WPRO}]
[-P] [-s] [-S] [-t TRANSDUCER] [-u] [-v] [-y] lemma [output]
positional arguments:
lemma lemma (determiners: Fem Nom Sg; nominalised
adjectives: Wk)
output output file (default: stdout)
options:
-h, --help show this help message and exit
-c, --csv output CSV table
-C, --force-color preserve color and formatting when piping output
-E, --no-empty suppress empty columns or values
-H, --no-header suppress table header
-i {1,2,3,4,5,6,7,8}, --lemma-index {1,2,3,4,5,6,7,8}
homographic lemma index
-I {1,2,3,4,5,6,7,8}, --paradigm-index {1,2,3,4,5,6,7,8}
paradigm index
-j, --json output JSON object
-n, --no-cats do not output category names
-N, --no-lemma do not output lemma, lemma index, paradigm index, and lexical categories
-o, --old output also archaic forms
-O, --oldorth output also forms in old spelling
-p {ADJ,ART,CARD,DEM,FRAC,INDEF,NN,NPROP,ORD,POSS,PPRO,REL,V,WPRO}, --pos {ADJ,ART,CARD,DEM,FRAC,INDEF,NN,NPROP,ORD,POSS,PPRO,REL,V,WPRO}
part of speech
-P, --plain suppress color and formatting
-s, --nonst output also non-standard forms
-S, --ch output also forms in Swiss spelling
-t TRANSDUCER, --transducer TRANSDUCER
path to transducer file in standard format (default: lib/dwdsmor-index.a)
-u, --user-specified use only user-specified information
-v, --version show program's version number and exit
-y, --yaml output YAML document
By default, paradigm.py
outputs a similar TSV table as dwdsmor.py
:
$ ./paradigm.py -E Bank
Lemma Lemma Index POS Gender Case Number Paradigm Forms
Bank 1 NN Fem Nom Sg Bank
Bank 1 NN Fem Acc Sg Bank
Bank 1 NN Fem Dat Sg Bank
Bank 1 NN Fem Gen Sg Bank
Bank 1 NN Fem Nom Pl Bänke
Bank 1 NN Fem Acc Pl Bänke
Bank 1 NN Fem Dat Pl Bänken
Bank 1 NN Fem Gen Pl Bänke
Bank 2 NN Fem Nom Sg Bank
Bank 2 NN Fem Acc Sg Bank
Bank 2 NN Fem Dat Sg Bank
Bank 2 NN Fem Gen Sg Bank
Bank 2 NN Fem Nom Pl Banken
Bank 2 NN Fem Acc Pl Banken
Bank 2 NN Fem Dat Pl Banken
Bank 2 NN Fem Gen Pl Banken
For a condensed version, the options -n
and -N
can be specified. The DWDS
homographic lemma index can be selected with option -i
:
$ ./paradigm.py -n -N -i 1 Bank
Paradigm Categories Paradigm Forms
Nom Sg Bank
Acc Sg Bank
Dat Sg Bank
Gen Sg Bank
Nom Pl Bänke
Acc Pl Bänke
Dat Pl Bänken
Gen Pl Bänke
The default transducer for paradigm generation is dwdsmor-index.a
and
restricted to inflection only. Paradigms for word-formation products which are
unavailable in the DWDS can be generated with the transducer dwdsmor.a
:
$ ./paradigm.py -n -N -t lib/dwdsmor.a Kinderbank
Paradigm Categories Paradigm Forms
Nom Sg Kinderbank
Acc Sg Kinderbank
Dat Sg Kinderbank
Gen Sg Kinderbank
Nom Pl Kinderbanken, Kinderbänke
Acc Pl Kinderbanken, Kinderbänke
Dat Pl Kinderbanken, Kinderbänken
Gen Pl Kinderbanken, Kinderbänke
Note that this transducer does not know of DWDS homographic lemma indices.
Again, options -c
, -j
, -y
select alternative CSV, JSON, and YAML outputs.
Feel free to contact Andreas Nolda for questions regarding the lexicon or the grammar and Gregor Middell for question related to the integration of DWDSmor into your corpus-annotation pipeline.
- Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. https://www.dwds.de
- Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. PDF
- Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In Headedness and/or Grammatical Anarchy?, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. PDF.
- Schmid, Helmut (2006). A programming language for finite state transducers. In Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. PDF.
- Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino et al., European Language Resources Association, 1263–1266. PDF
- Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari et al., European Language Resources Association, 1063–1067. PDF.
As the original SMOR and SMORLemma grammars, the DWDSmor grammar is licensed under the GNU General Public Licence v2.0. The rest of this project is licensed under the GNU Lesser General Public License v3.0.
Andreas Nolda [email protected]