Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF specification #2

Open
aethralis opened this issue Mar 7, 2024 · 4 comments
Open

PDF specification #2

aethralis opened this issue Mar 7, 2024 · 4 comments

Comments

@aethralis
Copy link

I have some issues with the pdf that rescribe creates. I'm using the latest version (1.2.0) and having trouble importing the produced text into r with pdftools. Error message is:

PDF error (142084): Unknown operator 'Inf'
PDF error (142084): Too few (0) args to 'Tz' operator

When repairing the pdf with ghostscript (with options -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress) it gives the following advice:

The following errors were encountered at least once while processing this file:
missing white space after number
error executing PDF token

**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

After repairing the file with gs the import into r with pdftools works fine.

@nickjwhite
Copy link
Member

Thanks for finding this and sending such a helpful writeup! I think I've found and fixed the issue, though it only happens with some inputs, and I haven't found one which reproduces your issue. Are you able to test a build for me? If you use Linux, this is a test build I just made: https://rescribe.xyz/tmp/rescribe-fixissue2-v1 - If you'd rather some other OS, let me know and I can make a test for you (or you can build one yourself from the fixpdf-issue2 branch).

@aethralis
Copy link
Author

aethralis commented Mar 12, 2024

Thank you for addressing this, I really appreciate it!

I tested the new build and

  1. If I try rescribe-fixissue2-v1 without flags I get the following error:
2024/03/12 16:17:51 No getgbook found [tried getgbook], google book downloading will be disabled, either set -gbookcmd on the command line or use the official build which includes an embedded getgbook.
Error: Training files rescribev9_fast.traineddata or /tmp/tesseract3617539103/tessdata/rescribev9_fast.traineddata could not be opened.
Set the `-t` flag with path to a tesseract .traineddata file.
  1. When using the suggestion (and downloading lat.traineddata):
    ./rescribe-fixissue2-v1 -t lat.traineddata test/

The resulting pdf imports into r fine, but if I (just to see, if it gives any suggestions) repair it again with gs, I get the following message:

The following warnings were encountered at least once while processing this file:
	File has Embedded files which could not be preserved

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> �� <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

Thanks again!

@nickjwhite
Copy link
Member

Thanks for checking and reporting that the original issue was fixed! I have just released v1.3.0, which includes the fix, and includes the embedded training data as other proper releases do (well done figuring out how to get that test build to work without it, by the way!)

Regarding the new issue you found, File has Embedded files which could not be preserved, I can't reproduce that on my end (yet). Opening a test PDF I created with gs test.pdf it just shows each page without complaining. I'm not very familiar with ghostscript, can you give more clues as to how to reproduce this please? It's possible this only occurs with some created PDFs, so if you are able if you could attach an example PDF which has the issue that would be helpful too.

@aethralis
Copy link
Author

aethralis commented Apr 25, 2024

Thanks again! When looking with gs test.pdf it does indeed not give any errors, but when using gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -o out.pdf test.pdf then at least I get still the warnings. These are not showstoppers, but maybe worth to have a look, what causes them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants