Skip to content

vorodrigues/databricks-mlops-stacks

Repository files navigation

Databricks MLOps Stacks

NOTE: This repository is based on https://github.com/databricks/mlops-stacks
NOTE: This feature is in public preview.

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.

Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML assets management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured.


Process

An ML solution comprises data, code, and models. These assets need to be developed, validated (staging), and deployed (production). In this repository, we use the notion of dev, staging, and prod to represent the execution environments of each stage.

An instantiated project from MLOps Stacks contains an ML pipeline with CI/CD workflows to test and deploy automated model training and batch inference jobs across your dev, staging, and prod Databricks workspaces.

Data scientists can iterate on ML code and file pull requests (PRs). This will trigger unit tests and integration tests in an isolated staging Databricks workspace. Model training and batch inference jobs in staging will immediately update to run the latest code when a PR is merged into main. After merging a PR into main, you can cut a new release branch as part of your regularly scheduled release process to promote ML code changes to production.


Step by Step

  1. Modify code in dev branch
  2. Commit changes to remote repository
  3. Open PR main < dev
    • Assets are deployed to TEST environment
    • Execute unit and integration tests
  4. Wait for tests to complete and approve PR
    • Assets are deployed to STAGING environment
  5. Open PR release < main
  6. Approve PR
    • Assets are deployed to PROD
  7. Wait for assets to be deployed
  8. Execute jobs in PROD

Set up

  1. Install Python from https://www.anaconda.com (3.8+ / tested on 3.9.12)

  2. Setup Databricks CLI (v0.211.0+ / tested on v0.212.0)

    • Install
    brew tap databricks/tap
    brew install databricks
    
  3. Setup your IDE of choice

  4. Setup MLOps Stacks project

    • Init project
    databricks bundle init mlops-stacks
    
    • Follow on-screen instructions
  5. Setup GitHub repository

    git init
    git remote add origin <url>
    git config user.name <user.name>
    git config user.email <user.email>
    git add *
    git add .github/*
    git commit -m init
    git push origin main
    git checkout -b dev
    
  6. Setup Inferene Input table

    • Follow steps on ./deployment/batch_inference/README.md

Customizations

  • Compute definitions (all-purpose cluster, cluster policy)
  • Schedule set to pause
  • Catalog and schema variables
  • Disabled comments on databricks-mlops-stacks-bundle-ci.yml
  • Added trigger conditions to CI pipeline

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages