Git Based MLOps

This project shows how to realize MLOps in Git/GitHub. In order to achieve this aim, this project heavily leverages the toolse such as DVC, DVC Studio, DVCLive - all products built by iterative.ai, Google Drive, Jarvislabs.ai, and HuggingFace Hub.

Instructions

Prior work

Click "Use this template" button to create your own repository
Wait for few seconds, then Initial Setup PR will be automatically created
Merge the PR, and you are good to go

Basic setup

Run pip install -r requirements.txt (requirements.txt)
Run dvc init to enable DVC
Add your data under data directory
Run git rm -r --cached 'data' && git commit -m "stop tracking data"
Run dvc add [ADDED FILE OR DIRECTORY] to track your data with DVC
Run dvc remote add -d gdrive_storage gdrive://[ID of specific folder in gdrive] to add Google Drive as the remote data storage
Run dvc push, then URL to auth is provided. Copy and paste it to the browser, and autheticate
Copy the content of .dvc/tmp/gdrive-user-credentials.json and put it as in GitHub Secret with the name of GDRIVE_CREDENTIAL
Run git add . && git commit -m "initial commit" && git push origin main to keep the initial setup
Write your own pipeline under pipeline directory. Codes for basic image classification in TensorFlow are provided initially.
Run the following dvc stage add for training stage

# if you want to use Iterative Studio / DVCLive for tracking training progress
$ dvc stage add -n train \
                -p train.train_size,train.batch_size,train.epoch,train.lr \
                -d pipeline/modeling.py -d pipeline/train.py -d data \
                --plots-no-cache dvclive/scalars/train/loss.tsv \
                --plots-no-cache dvclive/scalars/train/sparse_categorical_accuracy.tsv \
                --plots-no-cache dvclive/scalars/eval/loss.tsv \
                --plots-no-cache dvclive/scalars/eval/sparse_categorical_accuracy.tsv \
                -o outputs/model \
                python pipeline/train.py outputs/model

# if you want to use W&B for tracking training progress
$ dvc stage add -n train \
                -p train.train_size,train.batch_size,train.epoch,train.lr \
                -d pipeline/modeling.py -d pipeline/train_wandb.py -d data \
                -o outputs/model \
                python pipeline/train_wandb.py outputs/model

Run the following dvc stage add for evaluate stage

# if you want to use Iterative Studio / DVCLive for tracking training progress
$ dvc stage add -n evaluate \
                -p evaluate.test,evaluate.batch_size \
                -d pipeline/evaluate.py -d data/test -d outputs/model \
                -M outputs/metrics.json \
                python pipeline/evaluate.py outputs/model

# if you want to use W&B for tracking training progress
$ dvc stage add -n evaluate \
                -p evaluate.test,evaluate.batch_size \
                -d pipeline/evaluate.py -d data/test -d outputs/model \
                python pipeline/evaluate.py outputs/model

Update params.yaml as you need.
Run git add . && git commit -m "add initial pipeline setup" && git push origin main
Run dvc repro to run the pipeline initially
Run dvc add outputs/model.tar.gz to add compressed version of model
Run dvc push outputs/model.tar.gz
Run echo "/pipeline/__pycache__" >> .gitignore to ignore unnecessary directory
Run git add . && git commit -m "add initial pipeline run" && git push origin main
Add access token and user email of JarvisLabs.ai to GitHub Secret as JARVISLABS_ACCESS_TOKEN and JARVISLABS_USER_EMAIL
Add GitHub access token to GitHub Secret as GH_ACCESS_TOKEN
Create a PR and write #train --with dvc as in comment (you have to be the onwer of the repo)

W&B Integration Setup

Add W&B's project name to GitHub Secret as WANDB_PROJECT
Add W&B's API KEY to GitHub Secret as WANDB_API_KEY
Use #train --with wandb instead of #train --with dvc

HuggingFace Integration Setup

Add access token of HugginFace to GitHub Secret as HF_AT
Add username of HugginfFace to GitHub Secret as HF_USER_ID
Write #deploy-hf in comment of PR you want to deploy to HuggingFace Space
- GitHub Action assumes your model is archieved as model.tar.gz under outputs directory
- Algo GitHub Action assumes your HuggingFace Space app is written in Gradio under hf-space directory. You need to change app_template.py as you need(you shouldn't remove any environment variables in the file).

TODO

Brief description of each tools

DVC(Data Version Control): Manages data in somewhere else(i.e. cloud storage) while keeping the version and remote information in metadata file in Git repository.
DVCLive: Provides callbacks for ML framework(i.e. TensorFlow, Keras) to record metrics during training in tsv format.
DVC Studio: Visuallize the metrics from files in Git repository. What to visuallize is recorded in dvc.yaml.
Google Drive: Is used as a remote data repository. However, you can use others such as AWS S3, Google Cloud Storage, or your own file server.
Jarvislabs.ai: Is used to provision cloud GPU VM instances to conduct each experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github		.github
clouds		clouds
data		data
hf-space		hf-space
outputs		outputs
pipeline		pipeline
scripts		scripts
LICENSE		LICENSE
README.md		README.md
params.yaml		params.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Git Based MLOps

Instructions

Prior work

Basic setup

W&B Integration Setup

HuggingFace Integration Setup

TODO

Brief description of each tools

About

Releases

Packages

Contributors 2

Languages

License

codingpot/git-mlops

Folders and files

Latest commit

History

Repository files navigation

Git Based MLOps

Instructions

Prior work

Basic setup

W&B Integration Setup

HuggingFace Integration Setup

TODO

Brief description of each tools

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages