GitHub - KENDAXA-Development/pdf-utils: Extracting text, images and annotations from pdf files.

Tools for processing pdf files

This is a light-weighted library for processing pdf files in python. One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are

To install Poppler, see the guide in the pdf2image readme.

Some examples of usage are shown in the notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github		.github
notebook		notebook
pdf_utils		pdf_utils
tests		tests
.gitignore		.gitignore
CHANGELOG.txt		CHANGELOG.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.test.txt		requirements.test.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini