Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate function not right #6

Open
airkid opened this issue Mar 1, 2019 · 8 comments
Open

Evaluate function not right #6

airkid opened this issue Mar 1, 2019 · 8 comments

Comments

@airkid
Copy link

airkid commented Mar 1, 2019

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_)
There will be assert error.
I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

@DorianKodelja
Copy link

DorianKodelja commented Mar 1, 2019

This computes the score wrong since if the model predict a wrong entity before all the good ones, the preds are not aligned and the score is 0, as shown in this example:
gold roles are [(3,5,11),(7,9,9)]
preds roles are [(0,2,2),(3,5,11),(7,9,9)]
first iteration: compare (3,5,11) and (0,2,2) -> fail
second iteration: compare (7,9,9) and (3,5,11) -> fail even though (3,5,11) was in the gold annotations.
Here is a functionning version that also generate a per class report (it requires tabulate)

calculate_sets_1.txt

@mikelkl
Copy link

mikelkl commented Mar 7, 2019

Hi @airkid @DorianKodelja, I got with conclusion with you, according to DMCNN paper:

An argument is correctly classifiedd if its event subtype, offsets and argument role match those of any of the reference argument mentions

for item, item_ in zip(arguments, arguments_): 

Above code in this repo does match the idea, so I replaced that line with:

ct += len(set(arguments) & set(arguments_))  # count any argument in golden
# for item, item_ in zip(arguments, arguments_):
#     if item[2] == item_[2]:
#         ct += 1

@airkid
Copy link
Author

airkid commented Mar 7, 2019

Hi @mikelkl , I believe this is a kind of right implementation of calculating F1 score in this task.
Have you reproduce the experiment? I can only reach F1 score < 0.4 in the test data.

@mikelkl
Copy link

mikelkl commented Mar 7, 2019

Hi @airkid, I got slightly higher result, but it's on my own randomly splitting test set, hv no idea if it can efficively represent the paper result.

@airkid
Copy link
Author

airkid commented Mar 7, 2019

Hi @mikelkl, can you try on the data split update by author?
My result is still far away from the paper.

@mikelkl
Copy link

mikelkl commented Mar 11, 2019

Hi @airkid, I'm afraid I cannot do that coz I hv no ACE2005 English data

@carrie0307
Copy link

Hi @airkid Would you please tell me the result you got? I got only f1=0.64 in Trigger Classification.

@rhythmswing
Copy link

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_)
There will be assert error.
I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

Hi,

If you've tried their code, would you tell me your reproduced results on trigger detection and argument detection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants