About SROIE annotations #21

furkanpala · 2023-01-19T13:28:34Z

Hi,

You mentioned that you have annotated SROIE dataset to be able to use it effectively with ViBERTgrid. While annotating, what did you do with multiple occurring tokens? For example date label, there are receipts in which there are multiple occurrences of the same date. Have you annotated all of them as date or only one? Thanks.

Ajithbalakrishnan · 2023-01-19T13:58:45Z

@furkanpala are u able to achieve good performance in any std dataset?

furkanpala · 2023-01-19T14:19:04Z

Yes, I got good results on SROIE dataset and on my in-house datasets. However, I used the matching strategy that is mentioned in readme, thus there is 3-5% drop in scores.

Ajithbalakrishnan · 2023-01-22T09:17:29Z

Can you share your SROIE dataset and config file?
I guess there is some issue with the dataset which am trying to train.
I am not able to replicate the same SROIE results as mentioned in the paper. It is too poor. #20 (comment)
U can reach me at [email protected]

ZeningLin · 2023-01-23T03:32:35Z

Hi,

You mentioned that you have annotated SROIE dataset to be able to use it effectively with ViBERTgrid. While annotating, what did you do with multiple occurring tokens? For example date label, there are receipts in which there are multiple occurrences of the same date. Have you annotated all of them as date or only one? Thanks.

This is a worth noticing problem. We find that there might be multiple occurrences of "date" and "total" entities in a single receipt. For the "date" entities, we annotate the first matching string in reading order as the key information, while others are annotated as background. For the "total" entities, we pick the most likely one, which usually has a larger/bolded font and has a key containing string "total" nearby.

furkanpala · 2023-01-30T07:18:48Z

Can you share your SROIE dataset and config file? I guess there is some issue with the dataset which am trying to train. I am not able to replicate the same SROIE results as mentioned in the paper. It is too poor. #20 (comment) U can reach me at [email protected]

Unfortunately, I am not able to share the dataset right now

furkanpala · 2023-02-06T10:47:43Z

Hi again, I would like to follow up SROIE annotations if you do not mind. I wonder whether you used the preprocessed data for labelling the words or you OCRed the images from scratch. More precisely, we have the boxes, keys and images from original SROIE. Then, we can either use your sroie_data_preprocessing.py script to obtain word-level bounding boxes then we can annotate those preprocessed words, or we can explicitly OCR the images to obtain word-level bounding boxes. It would be nice if you could inform. I am asking beacuse I preprocessed original dataset using your script then annotated the resulted words which ended up having a bad performance (F1 score around %86).

ZeningLin · 2023-02-07T10:25:14Z

We used an in-house OCR engine to label the dataset from scratch.

furkanpala · 2023-02-14T15:10:31Z

Sorry for follow-ups. In readme file, you mentioned that the matcing key information fields with OCR output causes 3~5 points decrease in final F1 score. By final F1 score, do you mean token-level F1 scores or entity-level (official SROIE evaluations) scores? The reason I am asking is I obtained ~50% entity-level F1 score when I trained and evaluated using the matching strategy. Then, when I trained and evaluated on the re-labeled dataset (worth to mention that re-labeling has been done on the output of matching script, i.e., not OCRed from scratch), I obtained ~86% entity-level F1. It seems a lot more than 3~5 points decrease in entity-level F1 score compared to the results in the original paper which is 96.25%. Am I missing something or did you mean token-level F1 scores by 3~5 points decrease? Because even 3~5 points decrease in token-level F1 scores is enough to make a huge difference in the entity-level F1 score, in my opinion. Thanks a lot for your time...

ZeningLin · 2023-03-25T15:26:16Z

Hello, sorry for my late reply.

I re-launch the experiments using the dataset which is pre-processed by the matching strategy, and I got a token-level F1 of 97 and an entity-level F1 of 60. The result "3~5 point decrease" comes from earlier experiments using codes of previous version, and it seems that the metric computation strategy is wrong in the old codes. I will change the description in readme and provide a model weight trained using my re-labelled data (which has an entity-level F1 of 96+).

furkanpala · 2023-03-26T15:09:41Z

Hi,

Thanks for the reply.

furkanpala closed this as completed Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About SROIE annotations #21

About SROIE annotations #21

furkanpala commented Jan 19, 2023

Ajithbalakrishnan commented Jan 19, 2023 •

edited

Loading

furkanpala commented Jan 19, 2023

Ajithbalakrishnan commented Jan 22, 2023 •

edited

Loading

ZeningLin commented Jan 23, 2023

furkanpala commented Jan 30, 2023

furkanpala commented Feb 6, 2023

ZeningLin commented Feb 7, 2023

furkanpala commented Feb 14, 2023 •

edited

Loading

ZeningLin commented Mar 25, 2023

furkanpala commented Mar 26, 2023

About SROIE annotations #21

About SROIE annotations #21

Comments

furkanpala commented Jan 19, 2023

Ajithbalakrishnan commented Jan 19, 2023 • edited Loading

furkanpala commented Jan 19, 2023

Ajithbalakrishnan commented Jan 22, 2023 • edited Loading

ZeningLin commented Jan 23, 2023

furkanpala commented Jan 30, 2023

furkanpala commented Feb 6, 2023

ZeningLin commented Feb 7, 2023

furkanpala commented Feb 14, 2023 • edited Loading

ZeningLin commented Mar 25, 2023

furkanpala commented Mar 26, 2023

Ajithbalakrishnan commented Jan 19, 2023 •

edited

Loading

Ajithbalakrishnan commented Jan 22, 2023 •

edited

Loading

furkanpala commented Feb 14, 2023 •

edited

Loading