Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to restrict the areas of a page that are read? #193

Open
tpanza opened this issue Jun 14, 2024 · 4 comments
Open

Is there a way to restrict the areas of a page that are read? #193

tpanza opened this issue Jun 14, 2024 · 4 comments

Comments

@tpanza
Copy link

tpanza commented Jun 14, 2024

I am having to process a large PDF document. It has some logos, boilerplate text, and other useless text in the top, bottom, left, and right margins of every page.

The model(s) seem to be struggling with recognizing these and turning them into Markdown text. It results in random gobbledygook at every PDF page boundary in the resulting Markdown file.

Might there be a way to pass in some settings so that these margin areas are ignored? I took at look in the settings.py (https://github.com/VikParuchuri/marker/blob/master/marker/settings.py) but didn't see anything about that.

I see on the README, "Removes headers/footers/other artifacts", but how do I control/tweak that?

@luc42ei
Copy link

luc42ei commented Jun 21, 2024

yep, would be great to have that. in my case, I'd like to EXPAND THE AREA such that certain headers/footers are actually included because now they are omitted while containing important headings (e.g. of tables)

@luc42ei
Copy link

luc42ei commented Jun 22, 2024

actually, one can expand the area by changing the BAD_SPAN_TYPES parameter in the settings.py file. it seems like removing all elements there would imply expanding the area to 100%

@svenha
Copy link

svenha commented Dec 2, 2024

@luc42ei In version 1, there is no BAD_SPAN_TYPES anymore. What are you using now?

@Nevermetyou65
Copy link

Coming here for the sam thought. I see "Removes headers/footers/other artifacts" on the readme, but can not find a way to control it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants