Scraping Workshop

This is a workshop that teaches how to use Python to create a dataset by scraping a website.

This entails parsing HTML, downloading PDFs, and extracting data from PDFs.

Installation

Install JupyterLab if necessary (you can use a virtual environment). I set this up with Python3.10.

pip install -r requirements.txt

You can then run the jupyter-lab server.

Running the workshop

Just open the notebook in JupyterLab, it explains everything.

Backups

It's possible the source website will change or disappear entirely. It's archived in the bak/web directory. All the PDFs that should be downloaded are in bak/raw. A sample "final product" CSV is also included in the bak/data directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scraping Workshop

Installation

Running the workshop

Backups

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scraping Workshop

Installation

Running the workshop

Backups