NOTE: This repository is based on https://github.com/databricks/mlops-stacks
NOTE: This feature is in public preview.
This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.
Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML assets management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured.
An ML solution comprises data, code, and models. These assets need to be developed, validated (staging), and deployed (production). In this repository, we use the notion of dev, staging, and prod to represent the execution environments of each stage.
An instantiated project from MLOps Stacks contains an ML pipeline with CI/CD workflows to test and deploy automated model training and batch inference jobs across your dev, staging, and prod Databricks workspaces.
Data scientists can iterate on ML code and file pull requests (PRs). This will trigger unit tests and integration tests in an isolated staging Databricks workspace. Model training and batch inference jobs in staging will immediately update to run the latest code when a PR is merged into main. After merging a PR into main, you can cut a new release branch as part of your regularly scheduled release process to promote ML code changes to production.
- Modify code in
dev
branch - Commit changes to
remote
repository - Open PR
main
<dev
- Assets are deployed to
TEST
environment - Execute unit and integration tests
- Assets are deployed to
- Wait for tests to complete and approve PR
- Assets are deployed to
STAGING
environment
- Assets are deployed to
- Open PR
release
<main
- Approve PR
- Assets are deployed to
PROD
- Assets are deployed to
- Wait for assets to be deployed
- Execute jobs in
PROD
-
Install Python from https://www.anaconda.com (3.8+ / tested on 3.9.12)
-
Setup Databricks CLI (v0.211.0+ / tested on v0.212.0)
- Install
brew tap databricks/tap brew install databricks
-
- For VS Code:
- Install from https://code.visualstudio.com/download
- Install Python extension from https://marketplace.visualstudio.com/items?itemName=ms-python.python
- Create a directory for your project
- Create a new Pipenv from the project directory
pipenv --python <version>
- Select the project Python interpreter
- For VS Code:
-
Setup MLOps Stacks project
- Init project
databricks bundle init mlops-stacks
- Follow on-screen instructions
-
Setup GitHub repository
- Create a new remote repository
- Install GIT from https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
- Initialize your local repository from the project directory
git init git remote add origin <url> git config user.name <user.name> git config user.email <user.email> git add * git add .github/* git commit -m init git push origin main git checkout -b dev
- Generate Databricks PATs for STAGING and PROD environments
- Within the GitHub repository, navigate to Settings > Secrets and variables > Actions and setup the following secrets:
STAGING_WORKSPACE_TOKEN
PROD_WORKSPACE_TOKEN
-
Setup Inferene Input table
- Follow steps on
./deployment/batch_inference/README.md
- Follow steps on
- Compute definitions (all-purpose cluster, cluster policy)
- Schedule set to pause
- Catalog and schema variables
- Disabled comments on databricks-mlops-stacks-bundle-ci.yml
- Added trigger conditions to CI pipeline