Pipeline that scrapes data from r/india subreddit and finalizes data for the visual layer.
- Infra Provisioning: Terraform (with AWS)
- Containerization: Docker
- Orchestration: Airflow
- Visual Layer: Metabase
- Scrape data from r/india to generate bronze data
- Validate using Pydantic and load data to S3
- Generate and valiate silver data and load to S3
- Load silver data into Redshift
- AWS CLI and Terraform for infra provisioning
- Docker for Airflow and DAG execution
Setup and intial execution is handled by the Makefile.
make init
: Intializes Airflow (User setup, DB migrations)make infra
: Sets up the AWS Infrastructure (S3, Redshift, Budget) and creates theconfiguration.env
file with the secretsmake up
: Runs Airflow