Duplicate words in OCR result #330

jonathanMindee · 2021-06-26T10:37:56Z

🐛 Bug

Running the sample code:

from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_images(["table.png"])
# Analyze
result = model(doc)

result.show(doc)

I get this result:

Everything looks fine but there is some overlap between different words. The mouse is pointing to the word "Header4" and there is another word with the content "4". In that case I'm not able to reconstruct properly the table header as there is either an extra "4".

To Reproduce

Steps to reproduce the behavior:

download this image

Run the following code

from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_images(["table.png"])
# Analyze
result = model(doc)

result.show(doc)

The text was updated successfully, but these errors were encountered:

jonathanMindee · 2021-06-26T10:42:58Z

I think using some overlap detection postprocessing it's possible to filter out those duplicates.

fg-mindee · 2021-06-26T12:03:53Z

Thanks for reporting this!

I'm not sure which way would be the best, but here are some ideas to handle this:

Batch post-processing: NMS to perform with a looser threshold.
Manual post-processing: estimate candidate overlaps with a box IoU. For pairs where there is a text overlap as well, we perform a manual NMS (taking the one with the longest string while having the confidence above a given threshold). The probable issue would be that the predicted resulting string will wrongly not include the blank space.
Training-based: we add blank space as part of the vocab in the recognition and use NMS.

The first option being natively implemented in most modern DL frameworks, it might be a suitable option to try first

charlesmindee · 2021-06-28T08:11:28Z

I think we shouldn't only perform NMS, because here for instance we want to keep both boxes when there is an overlap. I see 2 solutions:

Merging the 2 boxes in 1 box, it is quick an easy but it can include undesirable spaces.
Arbitrarily shorten one of the 2 boxes to eliminate overlapping.

It is however an uncommon edge case, I think it only happens with underscores

charlesmindee · 2021-06-28T10:12:38Z

As a matter of fact, we do want to suppress very small boxes included in other ones, so I suggest the following:

performing NMS with a very high threshold (let's say > 80%) to filter boxes covered by other ones (avoid repetitions without loosing information).
merging boxes with a consistent overlapping but with a lower IOU (for instance, IOU between 20% & 80%), to keep all the information we need.

This overlapping seems to be mostly frequent with underscores, so I think it is a good approximation to merge boxes in that case (technically, it is the same word). What do you think @fg-mindee ?

fg-mindee · 2021-06-28T17:14:06Z

@charlesmindee Thanks for the suggestion!
However when I suggested an NMS, I thinking about the iterative merging implementation of it
So I fully agree that pure filtering won't be enough. As you mentioned, we might need to use another metric than IoU 👍

fg-mindee · 2021-12-10T13:31:12Z

Coming back to this issue, I suggest the following:

Investigate the heatmap of the text detection module to assess whether this comes from the segmentation or box conversion part (I'm especially interested in the overlapping localization candidates shown on the issue description image)
discuss options to handle the situation depending on our findings
as shown earlier, NMS isn't really the best option here since we're talking about small IoU overlaps. So if we tweak this NMS, that will start merging words that are correctly separated by a blank space

But let's not leave this issue unaddressed 😃

felixT2K · 2023-07-25T06:20:22Z

@frgfm @charlesmindee @odulcy-mindee

Seems to be solved with preserve_aspect_ratio=True.
(Both TF and PT are identically)
I have tested some personal documents and keeping the aspect ratio was always the better choice ... Should we use it by default wdyt ?

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True, preserve_aspect_ratio=True)
# PDF
doc = DocumentFile.from_images(["/home/felix/Desktop/table.png"])
# Analyze
result = model(doc)

result.show(doc)

charlesmindee · 2023-07-31T12:43:43Z

Hi @felixdittrich92, thanks for the suggestion, I think we can change the default behaviour since it is quite natural to preserve the aspect ratio by default. Moreover, it will make the predictions robuster to cropping.

jonathanMindee added the type: bug Something isn't working label Jun 26, 2021

charlesmindee self-assigned this Jun 27, 2021

charlesmindee mentioned this issue Jun 28, 2021

feat: add filter_boxes method #332

Merged

charlesmindee added the help wanted Extra attention is needed label Jul 2, 2021

felixdittrich92 mentioned this issue Apr 21, 2022

Crop intersecting bounding boxes to improve precision #895

Closed

felixdittrich92 added this to the 0.7.0 milestone Sep 26, 2022

felixdittrich92 mentioned this issue Sep 26, 2022

Release tracker - v0.9.0 #1074

Closed

6 tasks

felixdittrich92 linked a pull request Aug 10, 2023 that will close this issue

[predictor] aspect ratio true by default #1279

Merged

felixdittrich92 mentioned this issue Aug 10, 2023

[predictor] aspect ratio true by default #1279

Merged

felixdittrich92 closed this as completed in #1279 Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate words in OCR result #330

Duplicate words in OCR result #330

jonathanMindee commented Jun 26, 2021 •

edited

Loading

jonathanMindee commented Jun 26, 2021

fg-mindee commented Jun 26, 2021

charlesmindee commented Jun 28, 2021 •

edited

Loading

charlesmindee commented Jun 28, 2021 •

edited

Loading

fg-mindee commented Jun 28, 2021

fg-mindee commented Dec 10, 2021

felixT2K commented Jul 25, 2023

charlesmindee commented Jul 31, 2023

Duplicate words in OCR result #330

Duplicate words in OCR result #330

Comments

jonathanMindee commented Jun 26, 2021 • edited Loading

🐛 Bug

To Reproduce

jonathanMindee commented Jun 26, 2021

fg-mindee commented Jun 26, 2021

charlesmindee commented Jun 28, 2021 • edited Loading

charlesmindee commented Jun 28, 2021 • edited Loading

fg-mindee commented Jun 28, 2021

fg-mindee commented Dec 10, 2021

felixT2K commented Jul 25, 2023

charlesmindee commented Jul 31, 2023

jonathanMindee commented Jun 26, 2021 •

edited

Loading

charlesmindee commented Jun 28, 2021 •

edited

Loading

charlesmindee commented Jun 28, 2021 •

edited

Loading