entities vs. entitymentions in API #22

AbeHandler · 2015-06-14T21:17:55Z

Hi @brendano -- I'm trying to find cases where named entities co-refer within a document. For instance: "The Orleans Parish School Board is set to consider the proposal from the teacher's union. OPSB rejected a similar proposal at last month's meeting."

This seems fairly cumbersome w/ the current API. Each ['sentence'] has an ['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based on the position in the document. So if sentence 1 is 10 tokens, then sentence 2 might have a tokspan 11-13 for some named entity.

 "entitymentions": [{"charspan": [52, 61], "sentence": 1, "normalized": "THIS P1W OFFSET P1W",     "type": "DATE", "tokspan": [10, 12], "timex_xml": "<TIMEX3 ...</TIMEX3>"}, {"charspan": [129, 139], "type": "MISC", "tokspan": [24, 25], "sentence": 1}, {"charspan": [226, 238], "type": "PERSON", "tokspan": [40, 42], "sentence": 1}],

So far so good, but the ['entities'] counter goes sentence by sentence, giving a tokspan for each mention w/in a sentence.

{"mentions": {"head": 2, "animacy": "ANIMATE", "sentence": 11, "gender": "UNKNOWN", "mentionid": 84, "mentiontype": "PROPER", "number": "SINGULAR", "tokspan_in_sentence": [2, 3]}

This is workaroundable... but I wonder if the wrapper might be improved by changing the way tokspan is calculated for a given ['sentence'] -- or, alternately, adding a ['tokspan_in_sentence'] to each mention in a sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan that is limited to a given sentence, within a given ['sentence'] object.

If that change to the API will break everything that uses this wrapper, then maybe it's not worth it. But it does seem sort of confusing to fresh eyes.

See what I am getting at? Happy to work around or fork if you don't feel like mucking with it.

The text was updated successfully, but these errors were encountered:

brendano · 2015-06-14T21:31:16Z

yeah it would be great to have sentence relative token positions, perhaps
as well as doc relative positions. the deep internals of corenlp like to
use docrelative ones as far as i could figure out but maybe i didnt look
hard enough. take a look if you want. also check their xml output code
(which my code is based on) to see if they have it somewhere.

On Sunday, June 14, 2015, Abe Handler [email protected] wrote:

Hi @brendano https://github.com/brendano -- I'm trying to find cases
where named entities co-refer within a document. For instance: "The Orleans
Parish School Board is set to consider the proposal from the teacher's
union. OPSB rejected a similar proposal at last month's meeting."

This seems fairly cumbersome w/ the current API. Each ['sentence'] has an
['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based
on the position in the document. So if sentence 1 is 10 tokens, then
sentence 2 might have a tokspan 11-13 for some named entity.

"entitymentions": [{"charspan": [52, 61], "sentence": 1, "normalized": "THIS P1W OFFSET P1W", "type": "DATE", "tokspan": [10, 12], "timex_xml": "<TIMEX3 ..."}, {"charspan": [129, 139], "type": "MISC", "tokspan": [24, 25], "sentence": 1}, {"charspan": [226, 238], "type": "PERSON", "tokspan": [40, 42], "sentence": 1}],

So far so good, but the ['entities'] counter goes sentence by sentence,
giving a tokspan for each mention w/in a sentence.

{"mentions": {"head": 2, "animacy": "ANIMATE", "sentence": 11, "gender": "UNKNOWN", "mentionid": 84, "mentiontype": "PROPER", "number": "SINGULAR", "tokspan_in_sentence": [2, 3]}

This is workaroundable... but I wonder if the wrapper might be improved by
changing the way tokspan is calculated for a given ['sentence'] -- or,
alternately, adding a ['tokspan_in_sentence'] to each mention in a
sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan
that is limited to a given sentence, within a given ['sentence'] object.

If that change to the API will break everything that uses this wrapper,
then maybe it's not worth it. But it does seem sort of confusing to fresh
eyes.

See what I am getting at? Happy to work around or fork if you don't feel
like mucking with it.

—
Reply to this email directly or view it on GitHub
#22.

-brendan [mobile]

brendano · 2015-06-14T22:02:39Z

or to put it another way: yes, the inconsistency is really lame. if you can wrestle out something with more consistency out of corenlp go for it! this software is just a little layer on top of it and i am often have a hard time figuring out their stuff

AbeHandler · 2015-06-15T02:54:58Z

I often have a hard time figuring out their stuff too. This layer is a good idea -- I've wasted a bunch of time knitting Stanford components for one-off projects in Java. I will poke around and see if there a way to give per-sentence and per-document offsets in the json

AbeHandler · 2015-06-15T04:17:20Z

Hrm. So I guess the question is if this change should go in the python portion of stanford_corenlp_pywrapper or in the java portion of stanford_corenlp_pywrapper.

I have not actually run the java through a debugger, but based on reading the code it seems like the token numbers are coming out from the CoreNLP pipeline.

https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/javasrc/corenlp/JsonPipeline.java#L302

If that is how it is in fact working, then making the changing in the Python seems way, way easier as it does not require digging into the source for the CoreNLP tools which would be a huge pain. So maybe some kind of 'post processing' method hereabouts that cleans up the output from CoreNLP? https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200

brendano · 2015-06-15T04:41:19Z

i think you want to look at addEntityMentions() in the java code (i found
it by searching for "entitymentions" the keyname in the json output) and
see also the corenlp webpage which lists all the annotations they have.

postproc is dangerous bcs it's a maintainability burden. what if their
code changes, what if the assumptions behind the postproc are wrong. otoh
if you have to do it anyways, might as well write it into this level.

On Mon, Jun 15, 2015 at 12:17 AM, Abe Handler [email protected]
wrote:

Hrm. So I guess the question is if this change should go in the python
portion of stanford_corenlp_pywrapper or in the java portion of
stanford_corenlp_pywrapper.

I have not actually run the java through a debugger, but based on reading
the code it seems like the token numbers are coming out from the CoreNLP
pipeline.

https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/javasrc/corenlp/JsonPipeline.java#L302

If that is how it is in fact working, then making the changing in the
Python seems way, way easier as it does not require digging into the source
for the CoreNLP tools which would be a huge pain. So maybe some kind of
'post processing' method hereabouts that cleans up the output from CoreNLP.
https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

AbeHandler · 2015-06-15T13:39:40Z

Yah. I see your point on post processing. I will dig through the corenlp stuff, which I should know in the first place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entities vs. entitymentions in API #22

entities vs. entitymentions in API #22

AbeHandler commented Jun 14, 2015

brendano commented Jun 14, 2015

brendano commented Jun 14, 2015

AbeHandler commented Jun 15, 2015

AbeHandler commented Jun 15, 2015

brendano commented Jun 15, 2015

AbeHandler commented Jun 15, 2015

entities vs. entitymentions in API #22

entities vs. entitymentions in API #22

Comments

AbeHandler commented Jun 14, 2015

brendano commented Jun 14, 2015

brendano commented Jun 14, 2015

AbeHandler commented Jun 15, 2015

AbeHandler commented Jun 15, 2015

brendano commented Jun 15, 2015

AbeHandler commented Jun 15, 2015