CharSplit - An ngram-based compound splitter for German

Splits a German compound into its body and head, e.g.

Autobahnraststätte -> Autobahn - Raststätte

Implementation of the method decribed in the appendix of the thesis:

Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.

TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.

The method achieves ~95% accuracy for head detection on the Germanet compound test set.

A model is provided, trained on 10 Mio German nouns from newspaper text.

Usage

Train a new model:

python char_split_train.py <your_train_file>

where <your_train_file> contains one word (noun) per line.

Compound splitting

From command line:

python char_split.py <word>

Which outputs all possible splits, ranked by their score, e.g.

python char_split.py Autobahnraststätte
0.84096566854	Autobahn	Raststätte
-0.54568851959	Auto	Bahnraststätte
-0.719082070993	Autobahnrast	Stätte
...

As a module:

>>> import char_split
>>> char_split.split_compound('Autobahnraststätte')
[[0.8409656685402584, u'Autobahn', u'Rastst\xe4tte'], [-0.5456885195896692, u'Auto', u'Bahnrastst\xe4tte'], [-0.719082070992539, u'Autobahnrast', u'St\xe4tte'], ...]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
charsplit		charsplit
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CharSplit - An ngram-based compound splitter for German

Usage

About

Releases

Packages

Languages

License

idoraquel/CharSplit

Folders and files

Latest commit

History

Repository files navigation

CharSplit - An ngram-based compound splitter for German

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages