Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few Concerns on the Markdown Generation - Overlapping image/table/text boxes and Different output while using Surya #148

Open
Curiosity007 opened this issue May 29, 2024 · 1 comment

Comments

@Curiosity007
Copy link

Curiosity007 commented May 29, 2024

  1. Is it possible to not create overlapping bboxes, because that will help to identify the elements much easier. One such example like below -

image

  1. This image was using Surya Streamlit version. But when I run the same pdf through marker, the extracted image is very different, and it is actually truncated. Marker and Surya, these two repos are in sync? (I think Surya repo is using more recent layout model than marker)

  2. Is it possible to increase the bbox tolerance as an configurable argument, so, it can detect little more surrounding areas, when image detection is wrong and only gets cropped images

  3. Is it possible to extract tables as images, rather than directly printing it on the md file, at least a configurable option? Because table detection is not top notch.

  4. How to ensure, whatever is in table / image, text inside it is not again repeated in the md file?

@homeant
Copy link

homeant commented May 30, 2024

layout_0
I had the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants