HEART-DISEASE-ANALYSIS

Table of content

Business Context
Data Sources
System Design
- System Design - Data Warehouse development
- System Design - Machine Learning Pipeline
Tech Stack
- Docker
- Airflow
- MLFlow
- PostgreSQL
References

Business Context

The goal of this project is to analyze heart data to predict hypothetical future diseases.

Data Sources

Here you can find the source data of this project.

System Design

The project has been composed by two parts:

Data Warehouse development
Machine Learning pipeline

Both parts are implemented in Airflow as dag. So, each of them is composed by a sequence of tasks to accomplish a goal.

System Design - Data Warehouse development

System Design - Data Source

The input data are stored locally in a way that they are available from Docker containers.

System Design - Data Transformation/Loading

The transformation and loading operations are accomplished by the etl_dag script run on Airflow.

This DAG is responsible for extracting data (locally), transform and load into a PostgreSQL table.

It's possible to review PostgreSQL tables from PgAdmin. Below there's the ETL workflow on Airflow:

System Design - Data Warehouse

The Data Warehouse of the project has been stored on PostgreSQL.

Below there are the schemas of heart_fact, heart_disease_dim and account_dim.

CREATE TABLE IF NOT EXISTS heart_analysis.heart_fact(
	"account_id" varchar,
    "age" int,
    "sex" int,
    "cp" int,
    "trestbps" int,
    "chol" int,
    "fbs" int,
    "restecg" int,
    "thalach" int,
    "exang" int,
    "oldpeak" float,
    "slope" int,
    "ca" int,
    "thal" int,
    "target" int,
    PRIMARY KEY("account_id")
);

CREATE TABLE IF NOT EXISTS heart_analysis.heart_disease_dim(
	"account_id" varchar,
    "cp" int,
    "trestbps" int,
    "chol" int,
    "fbs" int,
    "restecg" int,
    "thalach" int,
    "exang" int,
    "oldpeak" float,
    "slope" int,
    "ca" int,
    "thal" int,
    "target" int,
    PRIMARY KEY("account_id")
);

CREATE TABLE IF NOT EXISTS heart_analysis.account_dim(
	"account_id" varchar,
    "age" int,
    "sex" int,
    PRIMARY KEY("account_id")
);

alt

System Design - Machine Learning Pipeline

Below there's the ML pipeline on Airflow:

Tech Stack

Docker

Create docker-compose.yaml which is responsible for running Airflow components, each on a different container:

airflow-webserver
airflow-scheduler
airflow-worker
airflow-triggerer
mlflow server
postgresql
pgadmin

From terminal, run the following command to start Airflow on port 8080:

docker compose up -d

Airflow

After running docker container, visit the page: localhost:8080

And log into the Airflow world!

Populate the dags folder with all the DAGS needed for the project. Before running any DAGs, establish a connection with PostgreSQL.

MLFlow

On the docker-compose.yaml includes the mlflow container in the services section. This container is responsible for running the MLFlow server exposed on the localhost:600.

Open the example_dag.py and set the URI of the current MLFlow server(localhost:600)

mlflow.set_tracking_uri('http://mlflow:600')

After updating the URI of the MLFlow server, create a new connection on Airflow. The experiment section on MLflow provides a table to compare experiment:

PostgreSQL

On the docker-compose.yaml includes the postgres and pgadmin containers in the services section. First of all, access to localhost:5050 to create a connection to postgres

Then, on the section server it's easy to monitor and query those tables.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
dags		dags
docker_mlflow		docker_mlflow
docs/imgs		docs/imgs
logs/scheduler		logs/scheduler
notebooks		notebooks
plugins		plugins
requirements		requirements
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HEART-DISEASE-ANALYSIS

Table of content

Business Context

Data Sources

System Design

System Design - Data Warehouse development

System Design - Data Source

System Design - Data Transformation/Loading

System Design - Data Warehouse

System Design - Machine Learning Pipeline

Tech Stack

Docker

Airflow

MLFlow

PostgreSQL

References

About

Releases 1

Packages

Languages

fabioba/heart-disease-analysis

Folders and files

Latest commit

History

Repository files navigation

HEART-DISEASE-ANALYSIS

Table of content

Business Context

Data Sources

System Design

System Design - Data Warehouse development

System Design - Data Source

System Design - Data Transformation/Loading

System Design - Data Warehouse

System Design - Machine Learning Pipeline

Tech Stack

Docker

Airflow

MLFlow

PostgreSQL

References

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages