Skip to content

Latest commit

 

History

History
33 lines (21 loc) · 1.13 KB

README.md

File metadata and controls

33 lines (21 loc) · 1.13 KB

Tools for processing pdf files

This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

  • obtaining textual and visual content of pdf files
  • locating positions of words
  • fetching pdf annotations
  • adding a digital layer to image-pdfs
  • re-creating a clean pdf file with annotations removed

Dependencies

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are

To install Poppler, see the guide in the pdf2image readme.

How to

Some examples of usage are shown in the notebook.

Todo

  • Add detection of page-orientation (upside-down, rotated,...) based on images.
  • Add some of our experiments with "naive" table detection
  • Get rid of PyPDF2 as it is not maintained; replace by PyMUPdf or pdfMiner.six.