DATA-CLIMATE-PIPELINE

Table of content

Business Context
Data Sources
System Design
Business Questions for the future

Business Context

The goal of this project is to analyze the relationship between:

temperature
country

These are some of the questions that it will be possible to answer:

What is the Country with the highest temperature in 2003?

SELECT 
    country_name
FROM 
    climate
where 
    date_trunc('year',date_time) = 2003
qualify row_number() over(partition by date_trunc('year',date_time) order by avg_temperature) = 1

What is the Country with the highest increase of temperature over last 10 years?

with temperature_avg_year as (
    SELECT 
    country_name,
    avg(avg_temperature) as avg_temperature_year
    FROM 
        climate
    group by 1
)
SELECT 
    country_name,
    lag(avg_temperature_year,10) over(partition by country_name order by date_time asc) as avg_temperature_10_years_ago,
    avg_temperature_year - avg_temperature_10_years_ago as change_temperature_in_10_years
FROM 
    temperature_avg_year

Eventually, this is the core question:

What is the best Country to live in the future?

Data Sources

System Design

alt

The first step regards the Data Lake which is developed on S3. So, it's easy to add unstructured files on the system. For our pourpose we have CSV files with a pre-defined structured, so it's even easier to maintain them. Then, the data source files are handled by Airflow. The workflow developed with that tool aim to clean the data source data and populate the Data Warehouse which sits on AWS - Redshift. The DAG should be run on a monthly basis. Since the data source is updated at the beginning of each month, this workflow should be run from the day after.

System Design - Data Lake

Below there's the folder structure of the data lake on S3:

bucket: data-climate
│
├── temperature_country.csv
├── temperature_state.parquet/
└── country.csv

temperature_state.parquet/ is a folder that contains the partition of temperature_state.parquet original file. Since this file is very big, it's better process one chunked file at a time.

System Design - Data Transformation/Loading

This steps is responsible for cleaning, transforming and loading data into datawarehouse. Before running the pipeline, make sure to create connections to AWS from Airflow Connection section.

alt

System Design - Data Warehouse

The image below refers to the data warehouse diagram, where:

country (
	"country_name" varchar(256),
	"country_id" varchar(256),
    PRIMARY KEY("country_id")
);


temperature_country (
	"date_time" timestamp,
	"avg_temperature" float,
	"country_name" varchar,
	PRIMARY KEY("country_name","date_time")
);

temperature_state (
	"date_time" timestamp,
	"avg_temperature" float,
	"state_name" varchar,
	"country_name" varchar,
	PRIMARY KEY("state_name","date_time")
);

climate (
	"date_time" timestamp,
	"avg_temperature" float,
	"country_name" varchar,
	"country_id" varchar,
	"max_temperature_state" float,
	"min_temperature_state" float,
	"distinct_state" int,
	PRIMARY KEY("country_id","date_time")
);

alt

Business Questions for the future

What If the data was increased by 100x?
- If the data was increased by 100x I could elaborate them using PySpark on a cluster, so each node will operate on a subset of the input dataset
What If the pipelines were run on a daily basis by 7am?
- In that case I would create a condition on the query to slice data based on the date_time daily basis
What If the database needed to be accessed by 100+ people?
- In that case I would enable auto scaling of the Redshift cluster to be able to scale out and scale in following users requests

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
dags		dags
docs/images		docs/images
logs/scheduler		logs/scheduler
notebooks		notebooks
plugins		plugins
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA-CLIMATE-PIPELINE

Table of content

Business Context

Data Sources

System Design

System Design - Data Lake

System Design - Data Transformation/Loading

System Design - Data Warehouse

Business Questions for the future

About

Releases

Packages

Languages

fabioba/data-climate-pipeline

Folders and files

Latest commit

History

Repository files navigation

DATA-CLIMATE-PIPELINE

Table of content

Business Context

Data Sources

System Design

System Design - Data Lake

System Design - Data Transformation/Loading

System Design - Data Warehouse

Business Questions for the future

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages