Skip to content

This project includes dwh development and ml pipeline to predict heart diseases.

Notifications You must be signed in to change notification settings

fabioba/heart-disease-analysis

Repository files navigation

HEART-DISEASE-ANALYSIS

Table of content

The goal of this project is to analyze heart data to predict hypothetical future diseases.

Here you can find the source data of this project.

The project has been composed by two parts:

  • Data Warehouse development
  • Machine Learning pipeline

Both parts are implemented in Airflow as dag. So, each of them is composed by a sequence of tasks to accomplish a goal.

img

The input data are stored locally in a way that they are available from Docker containers.

The transformation and loading operations are accomplished by the etl_dag script run on Airflow.

This DAG is responsible for extracting data (locally), transform and load into a PostgreSQL table.

It's possible to review PostgreSQL tables from PgAdmin. Below there's the ETL workflow on Airflow: img

The Data Warehouse of the project has been stored on PostgreSQL.

Below there are the schemas of heart_fact, heart_disease_dim and account_dim.

CREATE TABLE IF NOT EXISTS heart_analysis.heart_fact(
	"account_id" varchar,
    "age" int,
    "sex" int,
    "cp" int,
    "trestbps" int,
    "chol" int,
    "fbs" int,
    "restecg" int,
    "thalach" int,
    "exang" int,
    "oldpeak" float,
    "slope" int,
    "ca" int,
    "thal" int,
    "target" int,
    PRIMARY KEY("account_id")
);

CREATE TABLE IF NOT EXISTS heart_analysis.heart_disease_dim(
	"account_id" varchar,
    "cp" int,
    "trestbps" int,
    "chol" int,
    "fbs" int,
    "restecg" int,
    "thalach" int,
    "exang" int,
    "oldpeak" float,
    "slope" int,
    "ca" int,
    "thal" int,
    "target" int,
    PRIMARY KEY("account_id")
);

CREATE TABLE IF NOT EXISTS heart_analysis.account_dim(
	"account_id" varchar,
    "age" int,
    "sex" int,
    PRIMARY KEY("account_id")
);

alt

Below there's the ML pipeline on Airflow: img

Create docker-compose.yaml which is responsible for running Airflow components, each on a different container:

  • airflow-webserver
  • airflow-scheduler
  • airflow-worker
  • airflow-triggerer
  • mlflow server
  • postgresql
  • pgadmin

From terminal, run the following command to start Airflow on port 8080:

docker compose up -d

After running docker container, visit the page: localhost:8080 img

And log into the Airflow world!

Populate the dags folder with all the DAGS needed for the project. Before running any DAGs, establish a connection with PostgreSQL.

On the docker-compose.yaml includes the mlflow container in the services section. This container is responsible for running the MLFlow server exposed on the localhost:600. img

Open the example_dag.py and set the URI of the current MLFlow server(localhost:600)

mlflow.set_tracking_uri('http://mlflow:600')

After updating the URI of the MLFlow server, create a new connection on Airflow. The experiment section on MLflow provides a table to compare experiment: img

On the docker-compose.yaml includes the postgres and pgadmin containers in the services section. First of all, access to localhost:5050 to create a connection to postgres img

Then, on the section server it's easy to monitor and query those tables.