This is a workshop that teaches how to use Python to create a dataset by scraping a website.
This entails parsing HTML, downloading PDFs, and extracting data from PDFs.
Install JupyterLab if necessary (you can use a virtual environment).
I set this up with Python3.10
.
pip install -r requirements.txt
You can then run the jupyter-lab
server.
Just open the notebook in JupyterLab, it explains everything.
It's possible the source website will change or disappear entirely.
It's archived in the bak/web
directory.
All the PDFs that should be downloaded are in bak/raw
.
A sample "final product" CSV is also included in the bak/data
directory.