The goal of this project is to analyze heart data to predict hypothetical future diseases.
Here you can find the source data of this project.
The project has been composed by two parts:
Data Warehouse
developmentMachine Learning
pipeline
Both parts are implemented in Airflow
as dag. So, each of them is composed by a sequence of tasks to accomplish a goal.
The input data are stored locally in a way that they are available from Docker containers.
The transformation and loading operations are accomplished by the etl_dag script run on Airflow.
This DAG is responsible for extracting data (locally), transform and load into a PostgreSQL
table.
It's possible to review PostgreSQL
tables from PgAdmin
.
Below there's the ETL
workflow on Airflow
:
The Data Warehouse of the project has been stored on PostgreSQL.
Below there are the schemas of heart_fact
, heart_disease_dim
and account_dim
.
CREATE TABLE IF NOT EXISTS heart_analysis.heart_fact(
"account_id" varchar,
"age" int,
"sex" int,
"cp" int,
"trestbps" int,
"chol" int,
"fbs" int,
"restecg" int,
"thalach" int,
"exang" int,
"oldpeak" float,
"slope" int,
"ca" int,
"thal" int,
"target" int,
PRIMARY KEY("account_id")
);
CREATE TABLE IF NOT EXISTS heart_analysis.heart_disease_dim(
"account_id" varchar,
"cp" int,
"trestbps" int,
"chol" int,
"fbs" int,
"restecg" int,
"thalach" int,
"exang" int,
"oldpeak" float,
"slope" int,
"ca" int,
"thal" int,
"target" int,
PRIMARY KEY("account_id")
);
CREATE TABLE IF NOT EXISTS heart_analysis.account_dim(
"account_id" varchar,
"age" int,
"sex" int,
PRIMARY KEY("account_id")
);
Below there's the ML pipeline
on Airflow
:
Create docker-compose.yaml
which is responsible for running Airflow
components, each on a different container:
- airflow-webserver
- airflow-scheduler
- airflow-worker
- airflow-triggerer
- mlflow server
- postgresql
- pgadmin
From terminal, run the following command to start Airflow on port 8080:
docker compose up -d
After running docker container, visit the page: localhost:8080
And log into the Airflow world!
Populate the dags
folder with all the DAGS needed for the project.
Before running any DAGs, establish a connection with PostgreSQL.
On the docker-compose.yaml
includes the mlflow
container in the services
section.
This container is responsible for running the MLFlow server
exposed on the localhost:600
.
Open the example_dag.py
and set the URI of the current MLFlow server(localhost:600)
mlflow.set_tracking_uri('http://mlflow:600')
After updating the URI of the MLFlow server, create a new connection on Airflow
.
The experiment section on MLflow
provides a table to compare experiment:
On the docker-compose.yaml
includes the postgres
and pgadmin
containers in the services
section.
First of all, access to localhost:5050
to create a connection to postgres
Then, on the section server it's easy to monitor and query those tables.