redo_ocr_PDF some version conflict #1

kareliot · 2022-03-29T09:10:06Z

Hi Jmuccigr,

your redo_ocr script looks really interesting and I would love to use it for all these jstore pdfs sitting on my harddrive which I cannot properly annotate due to the poor ocr. Unfortunately, however, there seems to be some version conflict which I cannot solve on my own. Could you maybe help me out with some advice?

This is the error I get when I run redo_ocr_PDF.sh:

[philipp@philap pdf]$ ./redo_ocr_PDF.sh file1.pdf
No language was specified. Hit enter to use English or supply the 3-letter language code: 
Traceback (most recent call last):
  File "/home/philipp/Software/Skripte/pdf/./remove_PDF_text.py", line 17, in <module>
    with open(outputname, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/no_text.pdf'
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 573, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 891, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 782, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (pdfminer.six 20220319 (/usr/lib/python3.10/site-packages), Requirement.parse('pdfminer.six!=20200720,<=20211012,>=20191110'), {'ocrmypdf'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==13.4.0', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3.10/site-packages/ocrmypdf/__init__.py", line 10, in <module>
    from ocrmypdf import helpers, hocrtransform, pdfa, pdfinfo
  File "/usr/lib/python3.10/site-packages/ocrmypdf/helpers.py", line 22, in <module>
    import img2pdf
  File "/usr/lib/python3.10/site-packages/img2pdf.py", line 49, in <module>
    import pikepdf
  File "/usr/lib/python3.10/site-packages/pikepdf/__init__.py", line 19, in <module>
    from ._version import __version__
  File "/usr/lib/python3.10/site-packages/pikepdf/_version.py", line 7, in <module>
    from pkg_resources import DistributionNotFound
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3266, in <module>
    def _initialize_master_working_set():
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3240, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3278, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 575, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 588, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 777, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'pdfminer.six!=20200720,<=20211012,>=20191110' distribution was not found and is required by ocrmypdf
GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
GPL Ghostscript 9.55.0: **** Could not open the file /textonly.pdf .
**** Unable to open the initial device, quitting.
qpdf: open /final.pdf: Permission denied
Error: File not found - /final.pdf
    0 image files updated
    1 files weren't updated due to errors
Error: File not found - /final.pdf
    0 image files updated
    1 files weren't updated due to errors
mv: der Aufruf von stat für '/final.pdf' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
./redo_ocr_PDF.sh: Zeile 144: terminal-notifier: Kommando nicht gefunden.

The text was updated successfully, but these errors were encountered:

Jmuccigr · 2022-03-30T18:01:48Z

Well, clearly I should be catching more errors earlier in the script. :-)

I think the problem traces to the inability to open the output file. I suspect that may be Apple's file system protections. Can you try using a file in your Documents folder for output? Looks like you're using something in the root dir right now.

kareliot · 2022-04-02T14:30:09Z

Thanks for your reply!
I had not been aware that you had written it for mac (although it is actually pretty clear now ;))
I am using it on linux and was using a document from my user folder so I don’t think that file protection is the problem.
But maybe I do not have the required versions of pdfminer or ocrmypdf on my machine? The packages are up to date, though, so I am not sure what the problem is. I’ll definitely let you know if I have figured it out

Jmuccigr · 2022-04-02T19:51:20Z

I'm going on this first report of an error:

Traceback (most recent call last):
  File "/home/philipp/Software/Skripte/pdf/./remove_PDF_text.py", line 17, in <module>
    with open(outputname, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/no_text.pdf'

AFAICT, it looks like the script is trying to create a file in the root dir (Permission denied: '/no_text.pdf') and it doesn't have permission. The document you're using as input is not the issue. The no_text.pdf is a temporary output file, and the bash script sets it at $TMPDIR. I see that that may not be valid for some linux systems, so I've tweaked the script to handle it better (I think). Give it a whirl and let me know.

But maybe I do not have the required versions of pdfminer or ocrmypdf on my machine?

I doubt that's a problem. I'm using basic ocrmypdf functionality and this has worked for me through a few versions. I'm also not sure pdfminer is needed.

If it still doesn't work, try this from the command line and let me know what it says: echo ${TEMPDIR:-/tmp}

PS I've also removed the mac-specific notification at the end.

kareliot · 2022-04-04T09:17:43Z

That was definitely going into the right direction.
I had to change it to

tmpdir=`echo` $(mktemp -d)

to make it work.

I also changed:

 input="$tmpdir"input_new.pdf

to

input="$tmpdir"/input_new.pdf

and

input="$tmpdir"input_"$datestring".pdf

to

input="$tmpdir"/input_"$datestring".pdf

so that the file would be created in the temp-folder.

I guess the tmpdir should ideally also be removed again by the script after everything finished, but for now I do it manually.

The pdfminer issue is a temporary problem with an old version of ocrmypdf that I could fix.

The script runs through without errors now, but unfortunately it still does not give me a final.pdf (neither in the temp folder nor in the working directory of the script or the original file).
The textonly.pdf is extremely small (4,9 kb) so I am wondering whether there is a problem with ghostscript in the processing of the file. Or whether qpdf is having some troubles in correctly overlying the file. I’ll need to do some more debugging to find the reasons for this. If you have any ideas or suggestions on how to do this, let me know. Thanks for the help!

Jmuccigr · 2022-04-05T15:20:18Z

tmpdir=echo $(mktemp -d)

I'm not sure how mktemp works exactly, but if you use the system's temp dir, you shouldn't have to worry about deleting it at the end. That clean-up will happen automatically. Did the change I made not get you into the system's temp dir? If not, tell me how to get there and what system you're on, and I can change the script to handle it. You'll see that the script now just looks for $TMPDIR or /tmp. (That command above should have been echo ${TMPDIR:-/tmp} (no "E"). )

I also changed:

Also if you just add a "/" to the end of your tmpdir string, you can leave the other commands alone. But if you want to test, maybe set that to a known folder, so you can make sure everything else is working ok.

The textonly.pdf is extremely small (4,9 kb) so I am wondering whether there is a problem with ghostscript in the processing of the file.

The text-only file should be fairly small, since all it has in it is the text. 4.9k is very small for sure, though I don't know what your file looks like.

unfortunately it still does not give me a final.pdf

The output file isn't called "final.pdf", but has a date-stamped version of the original file's name, like filename_2022-04-05_11.11.57.pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redo_ocr_PDF some version conflict #1

redo_ocr_PDF some version conflict #1

kareliot commented Mar 29, 2022 •

edited

Loading

Jmuccigr commented Mar 30, 2022

kareliot commented Apr 2, 2022

Jmuccigr commented Apr 2, 2022 •

edited

Loading

kareliot commented Apr 4, 2022 •

edited

Loading

Jmuccigr commented Apr 5, 2022 •

edited

Loading

redo_ocr_PDF some version conflict #1

redo_ocr_PDF some version conflict #1

Comments

kareliot commented Mar 29, 2022 • edited Loading

Jmuccigr commented Mar 30, 2022

kareliot commented Apr 2, 2022

Jmuccigr commented Apr 2, 2022 • edited Loading

kareliot commented Apr 4, 2022 • edited Loading

Jmuccigr commented Apr 5, 2022 • edited Loading

kareliot commented Mar 29, 2022 •

edited

Loading

Jmuccigr commented Apr 2, 2022 •

edited

Loading

kareliot commented Apr 4, 2022 •

edited

Loading

Jmuccigr commented Apr 5, 2022 •

edited

Loading