This repo will serve as the central location for the code, Containerfiles and yamls needed to deploy Instructlab onto an OpenShift cluster with Red Hat OpenShift AI (RHOAI). This project leverages a number of the tools included with RHOAI working together to run InstructLab. Specifically, Data Science Pipelines for application orchestration, Kserve Serving for model serving, and the Distributed Training Operator to run our model training across multiple GPU enabled nodes.
This project makes running the InstructLab large language model (LLM) fine-tuning process easy and flexible on OpenShift. However, before getting started there are a few prerequisites and additional setup steps that needs to be completed.
The following Operators must be installed on your OpenShift cluster:
Within the DataScienceCluster definition the following must be defined and set to managed under Spec -> Components.
oc edit dsc
trainingoperator:
managementState: Managed
To verify the PytorchJob is available in the cluster run the following.
oc get pytorchjob
If an error is shown PyTorch is not available in cluster.
Once the above operators have been successfully installed, you will need to set up object storage for your models and pipeline artifacts. This solution requires object storage to be in place through S3 compatible storage such as Noobaa.
- If using Noobaa, apply the following tuning paramters.
- Create an
Object Bucket Claim
in your namespace. This will serve as the artifact store for your Data Science Pipeline.
From within the RHOAI dashboard, navigate to the "Data Science Pipelines" page and click "Configure pipeline server". This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step.
An accelerator profile must also be defined within the RHOAI dashboard or via CLI to enable GPU acceleration for model serving with Kserve Serving.
apiVersion: v1
items:
- apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
name: gpu
namespace: redhat-ods-applications
spec:
displayName: gpu
enabled: true
identifier: nvidia.com/gpu
tolerations: []
A signed certificate ensures that there are not any unnecessary issues when running the training pipeline.
To deploy a signed certificate in your cluster follow trusted cluster cert documentation.
In addition to model training, InstructLab also performs Synthetic Data Generation (SDG) and Model Evaluation. In both cases another LLM is required to complete these steps. Since these models do not change frequently, we recommend serving them independent of the specific InstructLab pipeline. This allows these these models to be used as a shared resources across the organization.
- Deploy the Teacher Model following these instructions.
- Deploy the Judge Model following these instructions.
Once these two model servers are deployed, we need to add the following configmaps and secrets to our namespace so that the InstructLab pipeline can successfully communicate with each model.
kind: ConfigMap
apiVersion: v1
metadata:
name: teacher-server
data:
endpoint: '<YOUR_MIXTRAL_MODEL_ENDPOINT>'
model: mixtral
ca.crt: | # If using TLS
-----BEGIN CERTIFICATE-----
<TLS Certificate to Teacher Model>
-----END CERTIFICATE-----
kind: Secret
apiVersion: v1
metadata:
name: teacher-server
data:
api_key: <YOUR_MIXTRAL_API_KEY>
type: Opaque
kind: ConfigMap
apiVersion: v1
metadata:
name: judge-server
data:
endpoint: '<YOUR_PROMETHEUS_MODEL_ENDPOINT>'
model: prometheus
ca.crt: | # If using TLS
-----BEGIN CERTIFICATE-----
<TLS Certificate to Judge Model>
-----END CERTIFICATE-----
kind: Secret
apiVersion: v1
metadata:
name: judge-server
data:
api_key: <YOUR_PROMETHEUS_API_KEY>
type: Opaque
NOTE: You can find and copy the certs needed for the teacher- and judge-server ConfigMaps in another ConfigMap, kube-root-ca.crt
, found in the same namespace as the hosted model
Now that all the cluster requirements have been setup, we are ready to upload and run our InstructLab pipeline!
Now we can go back to our RHOAI Data Science Pipelines dashboard and select "Import pipeline". We recommend importing the pipeline yaml directly from the github repo using: https://raw.githubusercontent.com/opendatahub-io/ilab-on-ocp/refs/heads/main/pipeline.yaml
Once the pipeline is uploaded we will be able to select "Create run" from the "Actions" dropdown. This will present us with a number of parameters we can set to customize our run. Click "Create run" at the bottom of the page to kick off your InstructLab pipeline.
Parameter | Definition |
---|---|
sdg_repo_url |
SDG parameter. Points to a taxonomy git repository |
sdg_repo_branch |
SDG parameter. Points to a branch within the taxonomy git repository. If set, has priority over sdg_repo_pr |
sdg_repo_pr |
SDG parameter. Points to a pull request against the taxonomy git repository |
sdg_base_model |
SDG parameter. LLM model used to generate the synthetic dataset |
sdg_scale_factor |
SDG parameter. The total number of instructions to be generated |
sdg_pipeline |
SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b. |
sdg_max_batch_len |
SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step. |
train_nproc_per_node |
Training parameter. Number of GPUs per each node/worker to use for training. |
train_nnodes |
Training parameter. Number of nodes/workers to train on. |
train_num_epochs_phase_1 |
Training parameter for in Phase 1. Number of epochs to run training. |
train_num_epochs_phase_2 |
Training parameter for in Phase 2. Number of epochs to run training. |
train_effective_batch_size_phase_1 |
Training parameter for in Phase 1. The number of samples in a batch that the model should see before its parameters are updated. |
train_effective_batch_size_phase_2 |
Training parameter for in Phase 2. The number of samples in a batch that the model should see before its parameters are updated. |
train_learning_rate_phase_1 |
Training parameter for in Phase 1. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
train_learning_rate_phase_2 |
Training parameter for in Phase 2. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
train_num_warmup_steps_phase_1 |
Training parameter for in Phase 1. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate. |
train_num_warmup_steps_phase_2 |
Training parameter for in Phase 2. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate. |
train_save_samples |
Training parameter. Number of samples the model should see before saving a checkpoint. |
train_max_batch_len |
Training parameter. Maximum tokens per gpu for each batch that will be handled in a single step. |
train_seed |
Training parameter. Random seed for initializing training. |
mt_bench_max_workers |
MT Bench parameter. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'. |
mt_bench_merge_system_user_message |
MT Bench parameter. Boolean indicating whether to merge system and user messages (required for Mistral based judges) |
final_eval_max_workers |
Final model evaluation parameter for MT Bench Branch. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'. |
final_eval_few_shots |
Final model evaluation parameter for MMLU. Number of question-answer pairs provided in the context preceding the question used for evaluation. |
final_eval_batch_size |
Final model evaluation parameter for MMLU. Batch size for evaluation. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory. |
final_eval_merge_system_user_message |
Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges) |
k8s_storage_class_name |
A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes. |
The pipeline.yaml
provided in this repo will always represent the most up to date version of the pipeline as our team continues to improve upon it as well as keep it in line with the InstructLab CLI. However, if you are a contributor or simply want to experiment with making custom changes to the pipeline that can be done by simply editing and "compiling" the pipeline.py
file provided in this repo.
The pipeline yaml is defined by pipeline.py
file and then converted into an intermediate representation yaml that Data Science Pipelines expects via the KubeFlow Pipelines python SDK. If you want to customize the pipeline in anyway, you can update pipeline.py
, run the below make command and then upload the pipeline to your Data Science Pipeline instance similar to how we showed above.
make pipeline
See standalone for instructions on deploying the Instructlab solution
without the need for RHOAI.
To generate the standalone.py
script, run the following command (ruff tool must be installed):
make standalone
To collaborate on this repository, please follow these steps:
- Install uv
- Run following commands to prepare your local environment
uv sync source .venv/bin/activate
When updating python package dependencies in pyproject.toml
, regenerate requirements.txt:
uv pip compile pyproject.toml > requirements.txt