Code to clean and perform exploratory data analysis on traffic crash datasets provided by the City of Chicago:
- Orchestration and raw data fetching / loading: Prefect
- Raw data transformations: dbt
- Database: PostgreSQL
- Install PostgreSQL
- Optionally create a dedicated Postgres user and database for this project (I created a db called
traffic_crashes
, then created a usercrash
with a separate password and granted the user on the database and all tables in it.)
- Clone this repo.
- Create a Python virtual environment and install requirements:
$ cd chicago_traffic_crashes
$ python -m venv venv
$ source venv/bin/activate
$ python -m pip install -r requirements.txt
- Rename
.env-example
to.env
and update values. - Rename
dbt_models/profiles-example.yml
toprofiles.yml
and update values.
To run the code manually:
$ cd chicago_traffic_crashes/orchestration
$ python3 main.py
This will immediately kick off a run of the full pipeline:
- Fetch crashes, people, and vehicles CSVs from the Chicago Open Data Portal website
- Load the data from the CSVs into local
[resource]_raw
PostgreSQL tables. - Run dbt models to transform raw data and load into
[resource]_stg
tables. - Run dbt models to create analytics views using
[resource]_stg
tables.
To run the pipeline on a schedule, run the Prefect server with prefect server
. The default schedule is once a day at midnight central time, but can be adjusted in deployment.py
.