Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I hope to find a way to remove headers and footers #433

Open
bjtangseng opened this issue Dec 19, 2024 · 2 comments
Open

I hope to find a way to remove headers and footers #433

bjtangseng opened this issue Dec 19, 2024 · 2 comments

Comments

@bjtangseng
Copy link

I used the marker project and felt that it was very good. I don't know if it was a problem with my use or if I didn't pay attention to some details.
I hope to find a way to filter out PDFs without footers, because the content in those areas is generally some irrelevant badges or some common languages. I don't know if a parameter can be added to reduce the interference of these useless information on the results of file conversion.

Thank you.

@VikParuchuri
Copy link
Owner

Can you please share an example PDF?

@bjtangseng
Copy link
Author

Thank you very much for your reply.
I will give you a sample file. This file is a PDF file that can be searched publicly in China and does not involve confidentiality issues.
You will find that the header of the first page will have a logo and the address of the organization that wrote this file. From the second page, there will be some small headers with logos. Some files will also have some footers, mainly some information such as the organization introduction and disclaimer.

I hope to add a parameter to skip this information, because I see that Surya can analyze the layout and also give clear footer and header positioning areas. Can it be used as an exclusion item and not perform corresponding identification and operations?

Thank you

fileView.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants