pdf-utils/README.md at master · KENDAXA-Development/pdf-utils · GitHub

Tools for processing pdf files

This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

obtaining textual and visual content of pdf files
locating positions of words
fetching pdf annotations
adding a digital layer to image-pdfs
re-creating a clean pdf file with annotations removed

Dependencies

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are

To install Poppler, see the guide in the pdf2image readme.

How to

Some examples of usage are shown in the notebook.

Todo

Add detection of page-orientation (upside-down, rotated,...) based on images.
Add some of our experiments with "naive" table detection
Get rid of PyPDF2 as it is not maintained; replace by PyMUPdf or pdfMiner.six.