You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HamleDT traditionally respected the tokenization of the original treebank. With the Universal Dependencies annotation style, which formulates some core axioms about tokenization, we should start normalizing it as well.
Many treebanks contain multi-word nodes where surface words are connected using the underscore character (or less frequently, the whitespace). We should split these mwe nodes into multiple normal nodes and use some heuristics to find the head and to define relations between the new nodes.
Catalan and Spanish (Ancora) contain empty nodes for dropped subjects. Most of the time these nodes are leaves (check with PML-TQ). We should remove them from the UD-style trees. (Note that this does not apply to the NULL nodes in Hindi, that cover ellipsis, and to the "_" nodes in Turkish, that reflect morphological derivation.)
The text was updated successfully, but these errors were encountered:
HamleDT traditionally respected the tokenization of the original treebank. With the Universal Dependencies annotation style, which formulates some core axioms about tokenization, we should start normalizing it as well.
The text was updated successfully, but these errors were encountered: