[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

winnieXY · 2025-02-06T09:24:25Z

Simple sanity checks

This is an issue with an app that uses OCRmyPDF for OCR
I am using a recent version of the third party app
I will include a file that reproduces the issuse

Third party app name and version

paperless-ngx 2.14.7

Describe the bug

I try to upload a file (BSAV - Beitragsorientierte Siemens Altersversorgung) - so sorry I won't provide that file to you ;-) and the upload fails with the error seen below.

When printing to another file and uploading this printed pdf everything works as expected.

If you need more information than the stacktrace please ping me, maybe I can provide/get more debug information for you.

Steps to reproduce

1. Import attached file into Paperless-ngx
2. Trigger OCR
3. Check log file
4. .. nothing else..

Files

No response

OCRmyPDF version

No response

Relevant log output

[2025-02-06 10:07:46,681] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-bj6qa3iw

[2025-02-06 10:07:46,685] [ERROR] [paperless.consumer] Error occurred while consuming document TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: AttributeError: 'int' object has no attribute 'get'

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = do_get_pdfinfo(origin_pdf, executor, options)

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 318, in do_get_pdfinfo

    return get_pdfinfo(

           ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 199, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1170, in __init__

    pscript5_mode = str(pdf.docinfo.get(Name.Creator, "")).startswith(

                        ^^^^^^^^^^^^^^^

AttributeError: 'int' object has no attribute 'get'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 449, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: AttributeError: 'int' object has no attribute 'get'

[2025-02-06 10:07:46,743] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: Error occurred while consuming document TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: AttributeError: 'int' object has no attribute 'get'

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = do_get_pdfinfo(origin_pdf, executor, options)

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 318, in do_get_pdfinfo

    return get_pdfinfo(

           ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 199, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1170, in __init__

    pscript5_mode = str(pdf.docinfo.get(Name.Creator, "")).startswith(

                        ^^^^^^^^^^^^^^^

AttributeError: 'int' object has no attribute 'get'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 449, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: AttributeError: 'int' object has no attribute 'get'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/tasks.py", line 154, in consume_file

    msg = plugin.run()

          ^^^^^^^^^^^^

  File "/usr/src/paperless/src/documents/consumer.py", line 509, in run

    self._fail(

  File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail

    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception

documents.consumer.ConsumerError: TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: Error occurred while consuming document TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: AttributeError: 'int' object has no attribute 'get'

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2025-02-07T08:40:16Z

Fixed in pikepdf v9.5.2

winnieXY added the triage Issue needs triage label Feb 6, 2025

winnieXY assigned jbarlow83 Feb 6, 2025

jbarlow83 closed this as completed Feb 7, 2025

github-actions bot removed the triage Issue needs triage label Feb 7, 2025

winnieXY mentioned this issue Feb 14, 2025

[BUG] ocrmypdf fails with AttributeError (fixed upstream) paperless-ngx/paperless-ngx#9111

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

winnieXY commented Feb 6, 2025

jbarlow83 commented Feb 7, 2025

[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

Comments

winnieXY commented Feb 6, 2025

Simple sanity checks

Third party app name and version

Describe the bug

Steps to reproduce

Files

OCRmyPDF version

Relevant log output

jbarlow83 commented Feb 7, 2025