Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

Closed
2 of 3 tasks
winnieXY opened this issue Feb 6, 2025 · 1 comment
Closed
2 of 3 tasks

[3rdparty]: paperless-ngx - ocrmypdf fails with AttributeError #1476

winnieXY opened this issue Feb 6, 2025 · 1 comment
Assignees

Comments

@winnieXY
Copy link

winnieXY commented Feb 6, 2025

Simple sanity checks

  • This is an issue with an app that uses OCRmyPDF for OCR
  • I am using a recent version of the third party app
  • I will include a file that reproduces the issuse

Third party app name and version

paperless-ngx 2.14.7

Describe the bug

I try to upload a file (BSAV - Beitragsorientierte Siemens Altersversorgung) - so sorry I won't provide that file to you ;-) and the upload fails with the error seen below.

When printing to another file and uploading this printed pdf everything works as expected.

If you need more information than the stacktrace please ping me, maybe I can provide/get more debug information for you.

Steps to reproduce

1. Import attached file into Paperless-ngx
2. Trigger OCR
3. Check log file
4. .. nothing else..

Files

No response

OCRmyPDF version

No response

Relevant log output

[2025-02-06 10:07:46,681] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-bj6qa3iw

[2025-02-06 10:07:46,685] [ERROR] [paperless.consumer] Error occurred while consuming document TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: AttributeError: 'int' object has no attribute 'get'

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = do_get_pdfinfo(origin_pdf, executor, options)

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 318, in do_get_pdfinfo

    return get_pdfinfo(

           ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 199, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1170, in __init__

    pscript5_mode = str(pdf.docinfo.get(Name.Creator, "")).startswith(

                        ^^^^^^^^^^^^^^^

AttributeError: 'int' object has no attribute 'get'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 449, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: AttributeError: 'int' object has no attribute 'get'

[2025-02-06 10:07:46,743] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: Error occurred while consuming document TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: AttributeError: 'int' object has no attribute 'get'

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = do_get_pdfinfo(origin_pdf, executor, options)

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 318, in do_get_pdfinfo

    return get_pdfinfo(

           ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 199, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1170, in __init__

    pscript5_mode = str(pdf.docinfo.get(Name.Creator, "")).startswith(

                        ^^^^^^^^^^^^^^^

AttributeError: 'int' object has no attribute 'get'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 449, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: AttributeError: 'int' object has no attribute 'get'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/tasks.py", line 154, in consume_file

    msg = plugin.run()

          ^^^^^^^^^^^^

  File "/usr/src/paperless/src/documents/consumer.py", line 509, in run

    self._fail(

  File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail

    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception

documents.consumer.ConsumerError: TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: Error occurred while consuming document TRS_BSAV-Kontoauszug_Z003PVYF_2025-1.pdf: AttributeError: 'int' object has no attribute 'get'
@winnieXY winnieXY added the triage Issue needs triage label Feb 6, 2025
@jbarlow83
Copy link
Collaborator

Fixed in pikepdf v9.5.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants