-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
entities vs. entitymentions in API #22
Comments
yeah it would be great to have sentence relative token positions, perhaps On Sunday, June 14, 2015, Abe Handler [email protected] wrote:
-brendan [mobile] |
or to put it another way: yes, the inconsistency is really lame. if you can wrestle out something with more consistency out of corenlp go for it! this software is just a little layer on top of it and i am often have a hard time figuring out their stuff |
I often have a hard time figuring out their stuff too. This layer is a good idea -- I've wasted a bunch of time knitting Stanford components for one-off projects in Java. I will poke around and see if there a way to give per-sentence and per-document offsets in the json |
Hrm. So I guess the question is if this change should go in the python portion of stanford_corenlp_pywrapper or in the java portion of stanford_corenlp_pywrapper. I have not actually run the java through a debugger, but based on reading the code it seems like the token numbers are coming out from the CoreNLP pipeline. If that is how it is in fact working, then making the changing in the Python seems way, way easier as it does not require digging into the source for the CoreNLP tools which would be a huge pain. So maybe some kind of 'post processing' method hereabouts that cleans up the output from CoreNLP? https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200 |
i think you want to look at addEntityMentions() in the java code (i found postproc is dangerous bcs it's a maintainability burden. what if their On Mon, Jun 15, 2015 at 12:17 AM, Abe Handler [email protected]
|
Yah. I see your point on post processing. I will dig through the corenlp stuff, which I should know in the first place. |
Hi @brendano -- I'm trying to find cases where named entities co-refer within a document. For instance: "The Orleans Parish School Board is set to consider the proposal from the teacher's union. OPSB rejected a similar proposal at last month's meeting."
This seems fairly cumbersome w/ the current API. Each ['sentence'] has an ['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based on the position in the document. So if sentence 1 is 10 tokens, then sentence 2 might have a tokspan 11-13 for some named entity.
So far so good, but the ['entities'] counter goes sentence by sentence, giving a tokspan for each mention w/in a sentence.
This is workaroundable... but I wonder if the wrapper might be improved by changing the way tokspan is calculated for a given ['sentence'] -- or, alternately, adding a ['tokspan_in_sentence'] to each mention in a sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan that is limited to a given sentence, within a given ['sentence'] object.
If that change to the API will break everything that uses this wrapper, then maybe it's not worth it. But it does seem sort of confusing to fresh eyes.
See what I am getting at? Happy to work around or fork if you don't feel like mucking with it.
The text was updated successfully, but these errors were encountered: