TypeError: Invalid input type 'PdfDocument' #183

dfanr · 2024-06-11T03:39:40Z

dictionary_output() function from pdftext only accept an pdf_path parameter, but here you passed in an doc object.

def get_text_blocks(doc, max_pages: Optional[int] = None) -> (List[Page], Dict):
    toc = get_toc(doc)

    page_range = range(len(doc))
    if max_pages:
        range_end = min(max_pages, len(doc))
        page_range = range(range_end)

    char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True)
    marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]

    return marker_blocks, toc

The text was updated successfully, but these errors were encountered:

dfanr · 2024-06-11T04:00:49Z

mac only support torch 2.2.2 now which makes it impossible to install the latest package.

aniketinamdar · 2024-06-11T16:18:55Z

Did you happen to get the documentation for pdftext . I am not able to find one. If yes, then please share the same

MapleFly52 · 2024-06-12T03:33:05Z

pdftext-0.3.7

aniketinamdar · 2024-06-12T07:13:00Z

def get_text_blocks(doc, fname, max_pages: Optional[int] = None, start_page: Optional[int] = None) -> (List[Page], Dict):
    toc = get_toc(doc)

    if start_page:
        assert start_page < len(doc)
    else:
        start_page = 0

    if max_pages:
        if max_pages + start_page > len(doc):
            max_pages = len(doc) - start_page
    else:
        max_pages = len(doc) - start_page

    page_range = range(start_page, start_page + max_pages)

    char_blocks = dictionary_output(fname, page_range=page_range, keep_chars=True, workers=settings.PDFTEXT_CPU_WORKERS)
    marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]

    return marker_blocks, toc

The change in the repository is now taking the filename instead of the document object..

wangxf2000 · 2024-06-25T07:56:09Z

I am running marker_pdf on the Mac server, using torch=2.2.2, pdftext=0.3.10 and marker-pdf=0.2.6,python=3.11
I also encountered this problem. I adjusted the environment several times, but still had the same problem.

marker_single /Users/xuefeng/Downloads/pdf/173000004046314212.pdf /Users/xuefeng/Downloads/pdf/ --batch_multiplier 1
/opt/anaconda3/envs/marker/lib/python3.11/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  warnings.warn(msg, RuntimeWarning)
Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32
Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32
Loading reading order model vikp/surya_order on device mps with dtype torch.float16
Loaded texify model to mps with torch.float16 dtype
Traceback (most recent call last):
  File "/opt/anaconda3/envs/marker/bin/marker_single", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/convert_single.py", line 26, in main
    full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/marker/convert.py", line 65, in convert_single_pdf
    pages, toc = get_text_blocks(
                 ^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks
    char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pdftext/extraction.py", line 75, in dictionary_output
    pages = _get_pages(pdf_path, model, page_range, workers=workers)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pdftext/extraction.py", line 26, in _get_pages
    pdf_doc = pdfium.PdfDocument(pdf_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf
    raise TypeError(f"Invalid input type '{type(input_data).__name__}'")
TypeError: Invalid input type 'PdfDocument'

HarmeetSingh07 · 2024-07-01T07:06:37Z

I got the same issue on my Mac. I have resolved it by creating a new venv only installing PyTorch and marker then running the marker command by passing the input file path and output path only.

wangxf2000 · 2024-07-01T07:17:33Z

My Mac has an Intel chip. Torch is only supported up to version 2.2.2. This version can only support up to marker version 0.2.6 and cannot support the latest version. Marker-pdf 0.2.14 depends on torch < 3.0.0 and >= 2.2.2. Surya-ocr 0.4.12 depends on torch < 3.0.0 and >= 2.3.0. Therefore, I need to change an environment for deployment on my side.

hashemian · 2024-07-21T13:23:03Z

I've read above but still don't know which version combination fixes it. I have an env with torch==2.2.2 , marker_pdf=0.2.6 , pdftext==0.3.10 , and surya-ocr==0.4.5 but I still get this error!

In case it's relevant, I installed all of the above via pip install marker_pdf==0.2.6

pigPEQ · 2024-07-24T07:39:49Z

Mac(intel) pdftext-0.3.7,torch==2.2.2,marker_pdf=0.2.6 Success!Thank you

diverged · 2024-07-27T09:05:51Z

Encountering the same issue on an x86 Mac.

traderpedroso · 2024-08-09T11:53:15Z

Encountering the same issue on an x86 Mac.

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6

mara004 · 2024-10-08T18:39:43Z

See #235 (comment)

VikParuchuri · 2024-10-18T15:56:40Z

This was a version mismatch, it should be fixed if you're using the version of pdftext specified in the latest marker

kenZhangCn mentioned this issue Aug 16, 2024

TypeError: Invalid input type 'PdfDocument' #235

Open

mara004 mentioned this issue Oct 8, 2024

Fix document loading bug VikParuchuri/pdftext#10

Merged

VikParuchuri closed this as completed Oct 18, 2024

pkarw added a commit to CatchTheTornado/text-extract-api that referenced this issue Oct 25, 2024

[fix] fix of version mismatch (VikParuchuri/marker#183)

cc3ab3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: Invalid input type 'PdfDocument' #183

TypeError: Invalid input type 'PdfDocument' #183

dfanr commented Jun 11, 2024

dfanr commented Jun 11, 2024

aniketinamdar commented Jun 11, 2024

MapleFly52 commented Jun 12, 2024

aniketinamdar commented Jun 12, 2024

wangxf2000 commented Jun 25, 2024

HarmeetSingh07 commented Jul 1, 2024

wangxf2000 commented Jul 1, 2024

hashemian commented Jul 21, 2024

pigPEQ commented Jul 24, 2024

diverged commented Jul 27, 2024

traderpedroso commented Aug 9, 2024

mara004 commented Oct 8, 2024

VikParuchuri commented Oct 18, 2024

TypeError: Invalid input type 'PdfDocument' #183

TypeError: Invalid input type 'PdfDocument' #183

Comments

dfanr commented Jun 11, 2024

dfanr commented Jun 11, 2024

aniketinamdar commented Jun 11, 2024

MapleFly52 commented Jun 12, 2024

aniketinamdar commented Jun 12, 2024

wangxf2000 commented Jun 25, 2024

HarmeetSingh07 commented Jul 1, 2024

wangxf2000 commented Jul 1, 2024

hashemian commented Jul 21, 2024

pigPEQ commented Jul 24, 2024

diverged commented Jul 27, 2024

traderpedroso commented Aug 9, 2024

mara004 commented Oct 8, 2024

VikParuchuri commented Oct 18, 2024