Skip to content
This repository has been archived by the owner on May 11, 2021. It is now read-only.

Latest commit

 

History

History
51 lines (38 loc) · 1.54 KB

README.md

File metadata and controls

51 lines (38 loc) · 1.54 KB

tesseract-tables

A tool for extracting tables, figures, maps, and pictures from PDFs using Tesseract

Installation

If you are using MacOS you can install the dependencies as so:

brew install ghostscript parallel tesseract

Next, install the Python dependencies:

pip install -r requirements.txt

Example usage

Assuming you have a document named my_doc.pdf, you can prepare it for processing and extract tables as so:

./preprocess.sh ./my_doc_processed ./my_doc.pdf
python do_extract.py ./my_doc_processed

This will extract tables and figures to ./my_doc_processed/tables. The first command will parse the PDF into the necessary directory structure and create the necessary data products for Tesseract. The second will extract tables.

preprocess.sh

Script for prepping a PDF for table extraction. Converts each page of the PDF to a PNG with Ghostscript, then runs the PNGs through Tesseract. Also runs each page through annotate.py to assist in debugging. Assumes local installation of tesseract-ocr.

Example usage

./preprocess.sh ./my_document_processed my_document.pdf

This creates the file structure necessary for extraction:

document_name
  annotated (pngs of what tesseract sees)
  png (each page of the PDF as a PNG image)
  tables (extractions)
  tesseract (HTML for each page produced by tesseract)
  orig.pdf (The original document)
  text.txt (The extracted text layer)

Funding

Development supported by NSF ICER 1343760

License

MIT