Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with two column format #3

Open
werkstattcodes opened this issue Dec 27, 2024 · 1 comment
Open

Issue with two column format #3

werkstattcodes opened this issue Dec 27, 2024 · 1 comment

Comments

@werkstattcodes
Copy link

Hello,

many thanks for all your efforts put into this project.

I am currently working on session transcripts which are in two column format. I had previously tried tesseract (bundled in R package tesseract and pdf_tools) but my results were not totally satisfying.

I now tried your approach, but to my surprise, it only recognized the header line of the page. Any suggestion what I am missing?
Attached I am sending you a sample page of the records. If your time permits, any help/suggestion would be very welcome.

Many thanks

imfname_158653_page4.pdf

@gabriel-piles
Copy link
Member

Thank you for reaching out.

The issue with the PDF is that some text is marked as hidden, making it unavailable in the XML format after conversion using Poppler's PDF-to-XML tool.

Furthermore, because the two-column layout is not properly detected, the text could be difficult to understand.

If you want to make it work, clone this repository:

git clone https://github.com/huridocs/pdf-document-layout-analysis
cd pdf-document-layout-analysis

Move to the following branch:

git checkout xml-hidden-by-default

And then execute the text extraction:

make start
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060

Let us know how it went.
Best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants