Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule based tagger verbose setting #15

Open
apmoore1 opened this issue Dec 8, 2021 · 0 comments
Open

Rule based tagger verbose setting #15

apmoore1 opened this issue Dec 8, 2021 · 0 comments

Comments

@apmoore1
Copy link
Member

apmoore1 commented Dec 8, 2021

A potential enhancement to the rule based taggers, both the spaCy version and non-spaCy version, could be a verbose setting whereby each token when it gets tagged will have another tag with the rule that produced that tag, e.g. In the rules for these taggers, shown below, we can add a label to each rule for example the first rule could be labelled R1 the second R2, etc when tagging in verbose mode each token can have one of these rules tags alongside the USAS tags, an example is shown below the rules. What do you think @perayson it could make the tagger more explainable and easier to debug for users.

Tagger Rules

  1. If pos_mapper is not None, map the POS, from the POS model,
    to the first POS value in the List from the pos_mappers Dict. If the
    pos_mapper cannot map the POS, from the POS model, go to step 9.
  2. If POS==punc label as PUNCT
  3. Lookup token and POS tag
  4. Lookup lemma and POS tag
  5. Lookup lower case token and POS tag
  6. Lookup lower case lemma and POS tag
  7. if POS==num label as N1
  8. If there is another POS value in the pos_mapper go back to step 2
    with this new POS value else carry on to step 9.
  9. Lookup token with any POS tag and choose first entry in lexicon.
  10. Lookup lemma with any POS tag and choose first entry in lexicon.
  11. Lookup lower case token with any POS tag and choose first entry in lexicon.
  12. Lookup lower case lemma with any POS tag and choose first entry in lexicon.
  13. Label as Z99, this is the unmatched semantic tag.

Example

This is an example of what it could output if we went ahead with this idea:

from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.rule_based import USASRuleBasedTagger
welsh_lexicon_url = 'https://raw.githubusercontent.com/apmoore1/Multilingual-USAS/master/Welsh/semantic_lexicon_cy.tsv'
lexicon_lookup = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=False)
tagger = USASRuleBasedTagger(lexicon_lookup, lemma_lexicon_lookup)
output = tagger.tag_token(('[','[','punc'), verbose=True)
usas_tags, rule = output
assert usas_tags == ['PUNCT']
# Second rule from the above rules, as it is
# a punctuation symbol
assert rule == 'R2' 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants