Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common ontology for part of speech #43

Open
3 tasks
humenda opened this issue Nov 4, 2018 · 2 comments
Open
3 tasks

Common ontology for part of speech #43

humenda opened this issue Nov 4, 2018 · 2 comments

Comments

@humenda
Copy link
Member

humenda commented Nov 4, 2018

Some dictionaries already have some kind of local ontology to reliably identify
parrt of speech (and potentially gender, etc.). Examples are the WikDict
dictionaries or eng-pol. Most other dictionaries lack this information, there
the <pos/> tag may contain arbitrary text. For machine-friendly
postprocessing, this should be mapped to an ontology, valid for all FreeDict
dictionaries.

Things to happen:

  • provide common ontology
  • mention in documentation that newly imported / created dictionaries need to use the ontology
  • convert existing dictionaries
@bansp
Copy link
Member

bansp commented Jan 4, 2021

I'm sorry I've missed this note.
Providing a common taxonomy / ontology even for the current set of databases is a formidable task, and mostly linguistic, at the core.

It would be much more practical to use an existing taxonomy. Back when I created the tagUsage mechanism for aggregating grammatical information, ISOCat was probably the hype (not an ontology, just a messy set of potentially orderly taxonomic groupings). But ISOCat is gone now, replaced by a proprietary engine aiming at something slightly different than our goals.

Another viable goal back then was the so-called GOLD ontology, created on the basis of a single comprehensive linguistic monograph, with (as far as I can recall, and this may be a false recollection) additions from various indigenous languages, coming from field workers. GOLD is not very alive nowadays, i'm afraid.

Somewhere along the way was/is the OLiA ontology, whose main mover is still very alive and kicking, so this could be worth exploring.

OR, something that has come to my mind right now and need not be the best solution for our goals, is the so-called universal tagset used by Universal Dependencies. The idealized picture would be to use each (non-universal) language-specific UD tagset and provide the (UD-supplied) mapping to the universal tagset.
I can imagine two troubles with that:

  1. there is often no single language-specific tagset for the particular language, on the UD approach; this is because UD datasets come from numerous corpora, and each of those corpora tends to use their own tagset (sometimes standardized at the, say, 'national level', like STTS for German or CLAWS for English; except note that CLAWS comes in several variants, and many corpora of English do not use CLAWS :-)).
  2. dictionary makers will only extremely rarely follow a corpus-based tagset, which would mean an extra step of aligning the PoS labels from the given dictionary with the PoS labels from the given corpus tagset.

Well, then... OLiA might be the only viable solution, currently.

@humenda
Copy link
Member Author

humenda commented Jan 9, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants