Issue with two column format #3

werkstattcodes · 2024-12-27T16:36:54Z

Hello,

many thanks for all your efforts put into this project.

I am currently working on session transcripts which are in two column format. I had previously tried tesseract (bundled in R package tesseract and pdf_tools) but my results were not totally satisfying.

I now tried your approach, but to my surprise, it only recognized the header line of the page. Any suggestion what I am missing?
Attached I am sending you a sample page of the records. If your time permits, any help/suggestion would be very welcome.

Many thanks

imfname_158653_page4.pdf

gabriel-piles · 2024-12-28T10:32:00Z

Thank you for reaching out.

The issue with the PDF is that some text is marked as hidden, making it unavailable in the XML format after conversion using Poppler's PDF-to-XML tool.

Furthermore, because the two-column layout is not properly detected, the text could be difficult to understand.

If you want to make it work, clone this repository:

git clone https://github.com/huridocs/pdf-document-layout-analysis
cd pdf-document-layout-analysis

Move to the following branch:

git checkout xml-hidden-by-default

And then execute the text extraction:

make start
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060

Let us know how it went.
Best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with two column format #3

Issue with two column format #3

werkstattcodes commented Dec 27, 2024

gabriel-piles commented Dec 28, 2024

Issue with two column format #3

Issue with two column format #3

Comments

werkstattcodes commented Dec 27, 2024

gabriel-piles commented Dec 28, 2024