Harmonization of tokenization #10

dan-zeman · 2015-08-14T11:24:25Z

HamleDT traditionally respected the tokenization of the original treebank. With the Universal Dependencies annotation style, which formulates some core axioms about tokenization, we should start normalizing it as well.

Many treebanks contain multi-word nodes where surface words are connected using the underscore character (or less frequently, the whitespace). We should split these mwe nodes into multiple normal nodes and use some heuristics to find the head and to define relations between the new nodes.
Catalan and Spanish (Ancora) contain empty nodes for dropped subjects. Most of the time these nodes are leaves (check with PML-TQ). We should remove them from the UD-style trees. (Note that this does not apply to the NULL nodes in Hindi, that cover ellipsis, and to the "_" nodes in Turkish, that reflect morphological derivation.)

dan-zeman added the enhancement label Aug 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonization of tokenization #10

Harmonization of tokenization #10

dan-zeman commented Aug 14, 2015

Harmonization of tokenization #10

Harmonization of tokenization #10

Comments

dan-zeman commented Aug 14, 2015