Skip to content

Latest commit

 

History

History
28 lines (18 loc) · 741 Bytes

README.md

File metadata and controls

28 lines (18 loc) · 741 Bytes

Scraping Workshop

This is a workshop that teaches how to use Python to create a dataset by scraping a website.

This entails parsing HTML, downloading PDFs, and extracting data from PDFs.

Installation

Install JupyterLab if necessary (you can use a virtual environment). I set this up with Python3.10.

pip install -r requirements.txt

You can then run the jupyter-lab server.

Running the workshop

Just open the notebook in JupyterLab, it explains everything.

Backups

It's possible the source website will change or disappear entirely. It's archived in the bak/web directory. All the PDFs that should be downloaded are in bak/raw. A sample "final product" CSV is also included in the bak/data directory.