Skip to content

Pipeline that scrapes data from r/india subreddit and finalizes data for the visual layer

License

Notifications You must be signed in to change notification settings

mananapr/reddit_india_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit India Pipeline

Pipeline that scrapes data from r/india subreddit and finalizes data for the visual layer.

Architecture

flowchart

  • Infra Provisioning: Terraform (with AWS)
  • Containerization: Docker
  • Orchestration: Airflow
  • Visual Layer: Metabase

DAG Tasks:

  1. Scrape data from r/india to generate bronze data
  2. Validate using Pydantic and load data to S3
  3. Generate and valiate silver data and load to S3
  4. Load silver data into Redshift

Requirements

  1. AWS CLI and Terraform for infra provisioning
  2. Docker for Airflow and DAG execution

Setup

Setup and intial execution is handled by the Makefile.

  1. make init: Intializes Airflow (User setup, DB migrations)
  2. make infra: Sets up the AWS Infrastructure (S3, Redshift, Budget) and creates the configuration.env file with the secrets
  3. make up: Runs Airflow

Dashboard

dashboard

About

Pipeline that scrapes data from r/india subreddit and finalizes data for the visual layer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published