Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entities vs. entitymentions in API #22

Open
AbeHandler opened this issue Jun 14, 2015 · 6 comments
Open

entities vs. entitymentions in API #22

AbeHandler opened this issue Jun 14, 2015 · 6 comments

Comments

@AbeHandler
Copy link
Collaborator

Hi @brendano -- I'm trying to find cases where named entities co-refer within a document. For instance: "The Orleans Parish School Board is set to consider the proposal from the teacher's union. OPSB rejected a similar proposal at last month's meeting."

This seems fairly cumbersome w/ the current API. Each ['sentence'] has an ['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based on the position in the document. So if sentence 1 is 10 tokens, then sentence 2 might have a tokspan 11-13 for some named entity.

 "entitymentions": [{"charspan": [52, 61], "sentence": 1, "normalized": "THIS P1W OFFSET P1W",     "type": "DATE", "tokspan": [10, 12], "timex_xml": "<TIMEX3 ...</TIMEX3>"}, {"charspan": [129, 139], "type": "MISC", "tokspan": [24, 25], "sentence": 1}, {"charspan": [226, 238], "type": "PERSON", "tokspan": [40, 42], "sentence": 1}],

So far so good, but the ['entities'] counter goes sentence by sentence, giving a tokspan for each mention w/in a sentence.

{"mentions": {"head": 2, "animacy": "ANIMATE", "sentence": 11, "gender": "UNKNOWN", "mentionid": 84, "mentiontype": "PROPER", "number": "SINGULAR", "tokspan_in_sentence": [2, 3]}

This is workaroundable... but I wonder if the wrapper might be improved by changing the way tokspan is calculated for a given ['sentence'] -- or, alternately, adding a ['tokspan_in_sentence'] to each mention in a sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan that is limited to a given sentence, within a given ['sentence'] object.

If that change to the API will break everything that uses this wrapper, then maybe it's not worth it. But it does seem sort of confusing to fresh eyes.

See what I am getting at? Happy to work around or fork if you don't feel like mucking with it.

@brendano
Copy link
Owner

yeah it would be great to have sentence relative token positions, perhaps
as well as doc relative positions. the deep internals of corenlp like to
use docrelative ones as far as i could figure out but maybe i didnt look
hard enough. take a look if you want. also check their xml output code
(which my code is based on) to see if they have it somewhere.

On Sunday, June 14, 2015, Abe Handler [email protected] wrote:

Hi @brendano https://github.com/brendano -- I'm trying to find cases
where named entities co-refer within a document. For instance: "The Orleans
Parish School Board is set to consider the proposal from the teacher's
union. OPSB rejected a similar proposal at last month's meeting."

This seems fairly cumbersome w/ the current API. Each ['sentence'] has an
['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based
on the position in the document. So if sentence 1 is 10 tokens, then
sentence 2 might have a tokspan 11-13 for some named entity.

"entitymentions": [{"charspan": [52, 61], "sentence": 1, "normalized": "THIS P1W OFFSET P1W", "type": "DATE", "tokspan": [10, 12], "timex_xml": "<TIMEX3 ..."}, {"charspan": [129, 139], "type": "MISC", "tokspan": [24, 25], "sentence": 1}, {"charspan": [226, 238], "type": "PERSON", "tokspan": [40, 42], "sentence": 1}],

So far so good, but the ['entities'] counter goes sentence by sentence,
giving a tokspan for each mention w/in a sentence.

{"mentions": {"head": 2, "animacy": "ANIMATE", "sentence": 11, "gender": "UNKNOWN", "mentionid": 84, "mentiontype": "PROPER", "number": "SINGULAR", "tokspan_in_sentence": [2, 3]}

This is workaroundable... but I wonder if the wrapper might be improved by
changing the way tokspan is calculated for a given ['sentence'] -- or,
alternately, adding a ['tokspan_in_sentence'] to each mention in a
sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan
that is limited to a given sentence, within a given ['sentence'] object.

If that change to the API will break everything that uses this wrapper,
then maybe it's not worth it. But it does seem sort of confusing to fresh
eyes.

See what I am getting at? Happy to work around or fork if you don't feel
like mucking with it.


Reply to this email directly or view it on GitHub
#22.

-brendan [mobile]

@brendano
Copy link
Owner

or to put it another way: yes, the inconsistency is really lame. if you can wrestle out something with more consistency out of corenlp go for it! this software is just a little layer on top of it and i am often have a hard time figuring out their stuff

@AbeHandler
Copy link
Collaborator Author

I often have a hard time figuring out their stuff too. This layer is a good idea -- I've wasted a bunch of time knitting Stanford components for one-off projects in Java. I will poke around and see if there a way to give per-sentence and per-document offsets in the json

@AbeHandler
Copy link
Collaborator Author

Hrm. So I guess the question is if this change should go in the python portion of stanford_corenlp_pywrapper or in the java portion of stanford_corenlp_pywrapper.

I have not actually run the java through a debugger, but based on reading the code it seems like the token numbers are coming out from the CoreNLP pipeline.

https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/javasrc/corenlp/JsonPipeline.java#L302

If that is how it is in fact working, then making the changing in the Python seems way, way easier as it does not require digging into the source for the CoreNLP tools which would be a huge pain. So maybe some kind of 'post processing' method hereabouts that cleans up the output from CoreNLP? https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200

@brendano
Copy link
Owner

i think you want to look at addEntityMentions() in the java code (i found
it by searching for "entitymentions" the keyname in the json output) and
see also the corenlp webpage which lists all the annotations they have.

postproc is dangerous bcs it's a maintainability burden. what if their
code changes, what if the assumptions behind the postproc are wrong. otoh
if you have to do it anyways, might as well write it into this level.

On Mon, Jun 15, 2015 at 12:17 AM, Abe Handler [email protected]
wrote:

Hrm. So I guess the question is if this change should go in the python
portion of stanford_corenlp_pywrapper or in the java portion of
stanford_corenlp_pywrapper.

I have not actually run the java through a debugger, but based on reading
the code it seems like the token numbers are coming out from the CoreNLP
pipeline.

https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/javasrc/corenlp/JsonPipeline.java#L302

If that is how it is in fact working, then making the changing in the
Python seems way, way easier as it does not require digging into the source
for the CoreNLP tools which would be a huge pain. So maybe some kind of
'post processing' method hereabouts that cleans up the output from CoreNLP.
https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200


Reply to this email directly or view it on GitHub
#22 (comment)
.

@AbeHandler
Copy link
Collaborator Author

Yah. I see your point on post processing. I will dig through the corenlp stuff, which I should know in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants