Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

PDF Miner returns different results every time #306

Open
aleksandar-devedzic opened this issue Apr 16, 2021 · 1 comment
Open

PDF Miner returns different results every time #306

aleksandar-devedzic opened this issue Apr 16, 2021 · 1 comment

Comments

@aleksandar-devedzic
Copy link

I have noticed the issue with PDF miner.
It returns different results each time for my PDF doc. This is my code:

import requests
from io import BytesIO
from pdfminer import high_level

def pdf_sublink_extraction(pdf_links, sleep):

    associatedTextList = []
    for pdf_link in pdf_links:
        print("pdf link", pdf_link, '\n')
        try:
            response = requests.get(pdf_link)
            print('response', response, '\n')
            with BytesIO(response.content) as data:

                num_of_pages = len(list(high_level.extract_pages(data)))

                full_pdf_text = high_level.extract_text(data, password='', page_numbers = None, maxpages = 5, codec='utf-8', caching=True, laparams=None)
                full_pdf_text = full_pdf_text.replace('\n\n\n\n', '\n').strip()

        except:
            full_pdf_text = "PDF File: " + pdf_link + "\n\nUnable to parse PDF file!"

    return full_pdf_text

print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))
print()
print()
print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))

I checked the results with this tool:
https://www.diffchecker.com/diff

And it returns different results. The difference is in numbers in some lines.

Is that a bug, or Im doing something wrong?

@kriffe
Copy link

kriffe commented Sep 3, 2021

If you run python version less than 3.7 you might get non deterministic behavior. https://stackoverflow.com/questions/14956313/why-is-dictionary-ordering-non-deterministic

Try upgrading to 3.7 and see if it runs more consistent

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants