You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I'm currently in midst of replicating the results for the relation extraction task on the ChemProt dataset using SciBERT but so far have been unsuccessful in achieving the F1 score as mentioned in your paper. Using the hyperparameters as described in your paper, I've been able to get an F1 score of 0.51 (Test Set) as detailed by the results from running the script provided in your codebase. Please advise if any further hyperparameter tuning or other tweaking is required to reproduce the F1 score as mentioned in your paper.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
pg427
changed the title
Reproducability issue with ChemProt results for relation extraction with SciBERT
Reproducability issue with ChemProt results for relation extraction using SciBERT
Apr 8, 2021
Sorry for missing this. I believe the issue you're seeing is a metric mismatch. For the Chemprot result, the standard metric is micro-F1 (which is computationally equivalent to accuracy) not macro-F1. In our experiments, our macro-F1 was also around 0.5, while micro-F1 is the number reported in the paper. We reference this in Table 1 caption:
Keeping with past work, we report macro F1 scores for NER (span-level), macro F1 scores for REL and CLS (sentence-level), and macro F1 for PICO (token-level), and micro F1 for ChemProt specifically.
See this screenshot of some experimental results (right-most column is the macro-F1 result you're seeing):
Hello,
I'm currently in midst of replicating the results for the relation extraction task on the ChemProt dataset using SciBERT but so far have been unsuccessful in achieving the F1 score as mentioned in your paper. Using the hyperparameters as described in your paper, I've been able to get an F1 score of 0.51 (Test Set) as detailed by the results from running the script provided in your codebase. Please advise if any further hyperparameter tuning or other tweaking is required to reproduce the F1 score as mentioned in your paper.
Thanks in advance!
The text was updated successfully, but these errors were encountered: