Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonization of tokenization #10

Open
dan-zeman opened this issue Aug 14, 2015 · 0 comments
Open

Harmonization of tokenization #10

dan-zeman opened this issue Aug 14, 2015 · 0 comments

Comments

@dan-zeman
Copy link
Member

HamleDT traditionally respected the tokenization of the original treebank. With the Universal Dependencies annotation style, which formulates some core axioms about tokenization, we should start normalizing it as well.

  1. Many treebanks contain multi-word nodes where surface words are connected using the underscore character (or less frequently, the whitespace). We should split these mwe nodes into multiple normal nodes and use some heuristics to find the head and to define relations between the new nodes.
  2. Catalan and Spanish (Ancora) contain empty nodes for dropped subjects. Most of the time these nodes are leaves (check with PML-TQ). We should remove them from the UD-style trees. (Note that this does not apply to the NULL nodes in Hindi, that cover ellipsis, and to the "_" nodes in Turkish, that reflect morphological derivation.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant