You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
many thanks for all your efforts put into this project.
I am currently working on session transcripts which are in two column format. I had previously tried tesseract (bundled in R package tesseract and pdf_tools) but my results were not totally satisfying.
I now tried your approach, but to my surprise, it only recognized the header line of the page. Any suggestion what I am missing?
Attached I am sending you a sample page of the records. If your time permits, any help/suggestion would be very welcome.
The issue with the PDF is that some text is marked as hidden, making it unavailable in the XML format after conversion using Poppler's PDF-to-XML tool.
Furthermore, because the two-column layout is not properly detected, the text could be difficult to understand.
If you want to make it work, clone this repository:
git clone https://github.com/huridocs/pdf-document-layout-analysis
cd pdf-document-layout-analysis
Move to the following branch:
git checkout xml-hidden-by-default
And then execute the text extraction:
make start
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060
Hello,
many thanks for all your efforts put into this project.
I am currently working on session transcripts which are in two column format. I had previously tried tesseract (bundled in R package tesseract and pdf_tools) but my results were not totally satisfying.
I now tried your approach, but to my surprise, it only recognized the header line of the page. Any suggestion what I am missing?
Attached I am sending you a sample page of the records. If your time permits, any help/suggestion would be very welcome.
Many thanks
imfname_158653_page4.pdf
The text was updated successfully, but these errors were encountered: