An end to end analytics platform that models and visualizes customer churn. The project blends together concepts from data engineering, analytics engineering, and MLOps. (please refer to the mlops branch for latest changes)
This project was developed to:
- Employ various elements of the modern data stack
- Be applied to a somewhat realistic business use case
- Serve as a template for others to learn from and use via extensive documentation
- Model retraining pipelines with Experiment Tracking using MLflow
- Automated model deployment using MLflow Model Registry
- Monitoring using Evidently and Grafana
- Implement unit and integration tests for Airflow functions.
- Develop SCD2 functionality for dimension tables.
- Evaluate feasibility of incremental refresh (models) via dbt to prevent overhead.
For this project, the Telco Customer Churn data module which is a sample dataset on IBM's Cognos Analytics platform was selected because it seemed like the best representative considering the difficulty in finding a decent dataset for the use case.
The dataset is then used to train two models. The first is a Gaussian Copula Synthesizer to produce synthetic data with characteristics similar to the original. This is done since there is not much data to go around and serves as a rudimentary imitation of data entering the database, The second is the product of using FLAML's AutoML implementation on the data, and its purpose is to predict churn status for a particular user. These experiments are tracked using MLFlow for better observability.
The synthesizer model is hosted via FastAPI, while the classifier is deployed with MLFlow model registry, which dynamically exposes the latest "Champion" model for inference. Airflow is then used to orchestrate the pulling of data from the Synthesizer, obtaining churn status prediction for said data from the classification model, generating a ULID for each customer, and writing it all to a Postgres database. Airflow is also used to trigger dbt afterward to run tests and apply necessary transformations. The data is modeled after the star schema and is finally visualized as a dashboard using Metabase. The various model metrics are also stored and is monitored with Grafana to proactively deal with any problems.
Almost all of the services above run in their own docker containers, as seen in the diagram. These containers are running on a GCP VM, provisioned via Terraform. Finally, GitHub Actions facilitates CI/CD, as the code is quickly checked for any issues using Ruff and changes made to this repo are reflected in the VM.
- Keep in mind that predicting customer churn is a rather sophisticated use case and is difficult to get right. AutoML was only employed to obtain a viable baseline model, and this is how it should be applied most of the time. Check out Fighting Churn with Data for a more comprehensive outlook on modelling customer churn.
- The original dataset has been transformed and can be found here. This is the dataset the models have been trained on.
- As mentioned earlier, the various elements which comprise this project have tried to be documented. Most files should have accompanying comments to facilitate understanding, and more high level system design justifications can be found in the README below.
If you want to run the pipeline on your own, here are some considerations:
- The Docker Compose file is the crux of this project, so that is a good starting point.
- Rename sample.env to .env
- Double check the ports for the various docker services if project is run locally. Ensure it doesn't conflict with ports used by other services on the system.
- Add path to credentials file and locally generated SSH files in the Terraform main file (main.tf).
- Replace variables in variables.tf to your liking.
- If CD implementation is going to be used, make changes to the output filepath (if necessary), and add necessary GitHub secrets.
- Beware of port forwarding if Terraform is being used to create the VM that the project runs on. If the instance external IP address is known, anyone can gain access to the running Docker containers.
- Used to quickly obtain a viable baseline
- First option for framework was h2o.ai as it seemed like a mature offering. However, it was difficult to get it to run locally.
- Used Microsoft's FLAML instead because it had streamlined access to core AutoML use cases.
- However, can be too minimalistic for some.
- Ran for 3 hours, the final model is LGBModel.
- Used predict_proba as a crude proxy for confidence. It can benefit from calibration for a more reliable measure.
- Went for this design to obtain something akin to realtime data
- Went with a gaussian copula model from the SDVault library
- Initially wanted to go with the popular CTGAN model from this paper, but the final model was an order of magnitude bigger and performed about the same as the GCop Model
- Size and speed of inference is important since the model going into production
- Constraints are added such that location data is generated consistently (i.e. you don't get random city associated with random zip code)
- Stands for Universally Unique Lexicographically Sortable IDentifier, used as primary key to uniquely identify each customer.
- An implementation of UUID (Universally Unique IDentifier), thus inheriting its advantages over a standard autoincrementing integer based primary key, namely robust uniqueness and security.
- In addition, since ULID is generated client side (as opposed to autoincrement), it prevents the need for going back to the primary source table to get the ID number to write to other tables.
- ULID also overcomes one of the major drawbacks associated with using UUIDs, which is the lack of sortability. This is of course until newer versions of UUID (like v6) that solve this problem are standardized.
- Generated using the python-ulid library, and converted to UUID format to be inserted into Postgres.
- One drawback is how the relatively large size of ULID (and UUIDs in general) affects performance, but that remains to be seen.
- Went with the tried and tested star schema (Kimball approach) for this project
- Briefly considered the recent one big table implementation as that would have the consequence of reducing data transformations.
- However, we don't have close to the amount of data necessary to see any real performance gains and it seems like falling into the trap of premature optimization.
- Plus, this approach better facilitates data integrity and ease of understanding.
- Fast lint check using Ruff
- Uses a GitHub actions module (appleboy) that SSHs into the VM and pull --rebase(s) this GitHub repo.
- Triggered on push to repo.
- Testing applied just for data at beginning of transformation layer via dbt.
- Only applied at source since the subsequent transformations don't affect the conditions checked by test.
- Plus, having redundant tests just increases performance overhead.
- Defines services for Postgres, pgAdmin, Airflow, FastAPI server, and Metabase.
- Persistent volumes mapped to services for backup purposes.
- Use env file for environment variables available throughout.
- Data synthesis and prediction models handled via one server.
- This is done to reduce overhead in setting up and maintaining multiple servers for services.
- Serves as an example of both serving a model for inference, and pulling from a model for synthetic data.
- Schema and tables created on startup.
- pgAdmin included for GUI, can also consider DBeaver.
- ULID for primary key handled by native UUID datatype.
- Also hosts Metabase metadata.
- Tasks are parallelized for efficiency.
- Use jinja scripting to pull XCom data into Postgres database.
- dbt Core used in Airflow container,
dbt build
triggered viaBashOperator
. - This implementation can be dangerous if there is the potential for resource or environmental clashes between the two services, so be wary. It would be wise to utilize tools like
pyenv
,poetry
, or the more modernuv
by Astral to sidestep this particular problem. - If you have the Kubernetes infrastructure handy, you can also do the same with
KubernetesPodOperator
for containerization and scalability. - Difficult to integrate dbt Cloud with Airflow unless using something like Astronomer's Astro CLI.
- At a production level, the de facto method of using dbt is via:
- dockerizing the project
- hosting it on an image registry
- Airflow triggers job from most recent image
- Every github push updates the image in the registry
- The aforementioned methods were not pursued in the interest of keeping the project (relatively) straightforward, but they can be considered.
- Acts as visualization layer
- Selected because it is quick to set up and put together a decent result.
- Dashboards developed can be shared publicly, and is therefore highly accessible.
- However, the link does expose the system's external IP, so that is something to be kept in mind.
- Spin up and tear down VMs quickly and reliably for rapid experimentation.
- Use output values for convenience and SSH capabilities.
- VM instance type determined via trial and error.
- Once again, be wary of port forwarding and general firewall settings.