This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.
Support for
- obtaining textual and visual content of pdf files
- locating positions of words
- fetching pdf annotations
- adding a digital layer to image-pdfs
- re-creating a clean pdf file with annotations removed
Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are
To install Poppler, see the guide in the pdf2image readme.
Some examples of usage are shown in the notebook.
- Add detection of page-orientation (upside-down, rotated,...) based on images.
- Add some of our experiments with "naive" table detection
- Get rid of PyPDF2 as it is not maintained; replace by PyMUPdf or pdfMiner.six.