Source code and data for the paper "Data Discovery for the SDGs: A Systematic Rule-based Approach".
You will need to install Poppler and Tesseract before you can run the Python code.
Python requirements can be found in requirements.txt
.
There are two Jupyter notebooks:
notebooks/01-Entity_Extraction.ipynb
: Loads PDFs fromdata/sdg7-papers
(note: PDFs note included in this repository), converts each to images using Poppler, then uses Tesseract to perform OCR to extract the text, then extracts entities based on rules derived fromdata/sdg7-coding-manual.xlsx
to outputsdg7-coding-auto.xlsx
.notebooks/02-Plot_Figures.ipynb
: Loadsdata/sdg7-coding-auto.xlsx
and applies some processing before plotting sunburst chart of data mapping distributions.