Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marker skips chapter during text extraction #166

Open
nekiee13 opened this issue Jun 2, 2024 · 3 comments
Open

Marker skips chapter during text extraction #166

nekiee13 opened this issue Jun 2, 2024 · 3 comments

Comments

@nekiee13
Copy link

nekiee13 commented Jun 2, 2024

  1. Installed Marker from the dev branch, under Win11. For some reason it always skips complete Chapter 5 -> "V. Instructions, Procedures, and Drawings"
    Document attached:
    10CFR50AppB_LibOff.pdf

  2. When executing script, I also get this warning message, but it seems that it does not cause any issues:

D:\PDF\vMarker2a\lib\site-packages\threadpoolctl.py:1214: RuntimeWarning: 
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  warnings.warn(msg, RuntimeWarning)
@VikParuchuri
Copy link
Owner

Probably the text for that chapter isn't in the PDF. If you set OCR_ALL_PAGES=true, does it do any differently?

@nekiee13
Copy link
Author

nekiee13 commented Jun 4, 2024

Set OCR to True and repeated. No change. Json confirms successful OCR. Attached Json log.

https://gist.github.com/nekiee13/43169c47126fd6f6d9f3de2438ead2dd

@myhloli
Copy link

myhloli commented Jul 11, 2024

Looking at the position of Chapter Five, it's located closer to the bottom of the page, which might have caused the model to misidentify it as a footnote or footer, thus removing it from the final output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants