Micro/Macro F1 score calculation is over-optimistic #9

sronnqvist · 2020-03-24T09:12:10Z

Hi!

Thanks for this neat tool. I've found it quite useful, however, I noticed that the micro and macro average F1 scores seem to be calculated incorrectly in the evaluation script, and might be giving overly optimistic numbers:

micro_f1 = f1_score(correct_output.reshape(-1).numpy(), predictions.reshape(-1).numpy(), average='micro')
Jump to code

f1_score supports "1d array-like, or label indicator array / sparse matrix". As I understand it, the above code flattens the matrix as if it was a multi-class setting, whereas the original matrix should be used in a multi-label setting, just transposed:

micro_f1 = f1_score(correct_output.T.numpy(), predictions.T.numpy(), average='micro')

With the data I'm testing on I got micro F1 of 0.88 with your version, and 0.35 with mine. I was able to verify that the latter is correct with another software.

You may want to check your numbers in the paper as well.

Kind regards,
Samuel

The text was updated successfully, but these errors were encountered:

AndriyMulyar · 2020-03-24T15:25:21Z

Samuel, Thank you for pointing this out. In the original datasets considered for evaluation, performance was calculated in the multi-class manner. I will be pushing out an update that will address this (give option between the two) alongside fixing another implementation bug in this public code release. Thanks! Andriy

…

On Tue, Mar 24, 2020, 5:12 AM Samuel Rönnqvist ***@***.***> wrote: Hi! Thanks for this neat tool. I've found it quite useful, however, I noticed that the micro and macro average F1 scores seem to be calculated incorrectly in the evaluation script, and might be giving overly optimistic numbers: micro_f1 = f1_score(correct_output.reshape(-1).numpy(), predictions.reshape(-1).numpy(), average='micro') Jump to code <https://github.com/AndriyMulyar/bert_document_classification/blob/060e9034a8c41bfb34b8762c8e1612321015c076/bert_document_classification/document_bert.py#L265> f1_score supports "1d array-like, or label indicator array / sparse matrix". As I understand it, the above code flattens the matrix as if it was a multi-class setting, whereas the original matrix should be used in a multi-label setting, just transposed: micro_f1 = f1_score(correct_output.T.numpy(), predictions.T.numpy(), average='micro') With the data I'm testing on I got micro F1 of 0.88 with your version, and 0.35 with mine. I was able to verify that the latter is correct with another software. You may want to check your numbers in the paper <https://arxiv.org/pdf/1910.13664.pdf> as well. Kind regards, Samuel — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADJ4TBVLQ24ICALNC33ENCDRJB2PTANCNFSM4LSPVQ3A> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro/Macro F1 score calculation is over-optimistic #9

Micro/Macro F1 score calculation is over-optimistic #9

sronnqvist commented Mar 24, 2020

AndriyMulyar commented Mar 24, 2020 via email

Micro/Macro F1 score calculation is over-optimistic #9

Micro/Macro F1 score calculation is over-optimistic #9

Comments

sronnqvist commented Mar 24, 2020

AndriyMulyar commented Mar 24, 2020 via email