Skip to content

Extracting text, images and annotations from pdf files.

License

Notifications You must be signed in to change notification settings

franp9am/pdf-utils

 
 

Repository files navigation

Tools for processing pdf files

This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

  • obtaining textual and visual content of pdf files
  • locating positions of words
  • fetching pdf annotations
  • adding a digital layer to image-pdfs
  • re-creating a clean pdf file with annotations removed

Dependencies

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are

To install Poppler, see the guide in the pdf2image readme.

How to

Some examples of usage are shown in the notebook.

Todo

  • Add detection of page-orientation (upside-down, rotated,...) based on images.
  • Add some of our experiments with "naive" table detection
  • Get rid of PyPDF2 as it is not maintained; replace by PyMUPdf or pdfMiner.six.

About

Extracting text, images and annotations from pdf files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 69.7%
  • Python 30.3%