Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node Multi card MPIJob #1702

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions examples/kubernetes/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@ name: optimum-habana-example-chart
description: This Helm chart deploys example jobs using Optimum for Intel® Gaudi® Accelerators to a Kubernetes cluster.

# Compatible Kubernetes versions
kubeVersion: 1.27-1.29
kubeVersion: v1.28.7

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

1 change: 1 addition & 0 deletions examples/kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ Validated use cases can be found in the `ci` directory:
| [`ci/multi-card-glue-values.yaml`](ci/multi-card-glue-values.yaml) | 2 | Uses 2 HPUs from a single node with the [`gaudi_spawn.py`](../gaudi_spawn.py) script to [fine tune BERT large](../text-classification/README.md#multi-card-training) (with whole word masking) on the text classification MRPC task using `run_glue.py`.
| [`ci/single-card-lora-clm-values.yaml`](ci/single-card-lora-clm-values.yaml) | 1 | Uses a single card to [fine tune Llama1-7B](../language-modeling/README.md#peft) with LoRA using the `run_lora_clm.py` script.
| [`ci/multi-card-lora-clm-values.yaml`](ci/multi-card-lora-clm-values.yaml) | 8 | Uses 8 HPUs from a single node with the [`gaudi_spawn.py`](../gaudi_spawn.py) script to [fine tune Llama1-7B](../language-modeling/README.md#peft) with LoRA using the `run_lora_clm.py` script.
| [`ci/multi-card-lora-clm-values.yaml`](ci/multi-card-lora-clm-values.yaml) | 8 | Uses 8 HPUs from a single node with the [`gaudi_spawn.py`](../gaudi_spawn.py) script to [fine tune Llama1-7B](../language-modeling/README.md#peft) with LoRA using the `run_lora_clm.py` script.
sramakintel marked this conversation as resolved.
Show resolved Hide resolved

### Deploy job to the cluster

Expand Down
142 changes: 142 additions & 0 deletions examples/kubernetes/ci/multi-node-multi-card-lora-clm-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Default values for examples.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

image:
# -- Determines when the kubelet will pull the image to the worker nodes. Choose from: `IfNotPresent`, `Always`, or `Never`. If updates to the image have been made, use `Always` to ensure the newest image is used.
pullPolicy: Always
cleanPodPolicy: Running
# -- Repository and name of the docker image
repository:
# -- Tag of the docker image
tag:

imagePullSecrets: []

# # -- Pod [annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) to attach metadata to the job
podAnnotations: {}

# # -- Specify a pod security context to run as a non-root user
# podSecurityContext:
# fsGroup: 1000

# securityContext:
# # -- Run as privileged or unprivileged. Certain deployments may require running as privileged, check with your system admin.
privileged: false

# -- The default 64MB of shared memory for docker containers can be insufficient when using more than one HPU. Setting hostIPC: true allows reusing the host's shared memory space inside the container.
hostIPC: true

# -- Define a config map's data as container environment variables
envFrom: []

# -- Define environment variables to set in the container
env:
- name: LOGLEVEL
value: INFO

secret:
# -- Hugging Face token encoded using base64.
encodedToken:
# -- If a token is provided, specify a mount path that will be used to set HF_TOKEN_PATH
secretMountPath: /tmp/hf_token

storage:
# -- Name of the storage class to use for the persistent volume claim. To list the available storage classes use: `kubectl get storageclass`.
storageClassName: csi-wekafs-fs
sramakintel marked this conversation as resolved.
Show resolved Hide resolved
# -- [Access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) for the persistent volume.
accessModes:
- "ReadWriteMany"
# -- Storage [resources](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#resources)
resources:
requests:
storage: 30Gi
# -- Locaton where the PVC will be mounted in the pods
pvcMountPath: &pvcMountPath /tmp/pvc-mount
# -- A data access pod will be deployed when set to true
deployDataAccessPod: false

resources:
limits:
# -- Specify the number of Gaudi card(s)
cpu: 16
habana.ai/gaudi: 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested and validated to run on < 8 cards on multiple nodes?

Copy link
Author

@sramakintel sramakintel Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltran5991 it has been tested on 2 nodes with one card each

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bout 2 nodes with 2 cards each?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bout 2 nodes with 2 cards each?

@sramakintel ,
Could you test with 2node/2cards and confirm the code works. Thanks.

# -- Specify [Memory limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory) requests for the job
memory: 64Gi
sramakintel marked this conversation as resolved.
Show resolved Hide resolved
# -- Specify hugepages-2Mi requests for the job
hugepages-2Mi: 4400Mi
requests:
# -- Specify the number of Gaudi card(s)
cpu: 16
habana.ai/gaudi: 2
sramakintel marked this conversation as resolved.
Show resolved Hide resolved
# -- Specify [Memory resource](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory) requests for the job
memory: 64Gi
# -- Specify hugepages-2Mi requests for the job
hugepages-2Mi: 4400Mi


# -- Number of Gaudi nodes to be used
numNodes: 2
# -- Number of Gaudi cards to be used per one node
numCards: 1
# -- Number of slots per worker
slotsPerWorker: 1


# Define the command to run in the container
command:
# python command to supply mpi run commands:
- python
- /optimum-habana/examples/language-modeling/run_lora_clm.py
- --model_name_or_path
- huggyllama/llama-7b
- --dataset_name
- tatsu-lab/alpaca
- --bf16
- --output_dir
- *pvcMountPath
- --num_train_epochs
- "3"
- --per_device_train_batch_size
- "12"
- --evaluation_strategy
- "no"
- --save_strategy
- "no"
- --learning_rate
- "1e-4"
- --warmup_ratio
- "0.03"
- --lr_scheduler_type
- "constant"
- --max_grad_norm
- "0.3"
- --logging_steps
- "1"
- --do_train
- --do_eval
- --use_habana
- --use_lazy_mode
- --throughput_warmup_steps
- "3"
- --lora_rank
- "8"
- --lora_alpha=16
- --lora_dropout=0.05
- --lora_target_modules
- "q_proj"
- "v_proj"
- --dataset_concatenation
- --max_seq_length=512
- --low_cpu_mem_usage=True
- --validation_split_percentage=4
- --adam_epsilon=1e-08

# # -- Optionally specify a [node selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) with labels the determine which node your worker pod will land on
nodeSelector: {}

# # -- Optionally specify [tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) to allow the worker pod to land on a node with a taint.
tolerations: []

# # -- Optionally provide node [affinities](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity) to constrain which node your worker pod will be scheduled on
affinity: {}
91 changes: 91 additions & 0 deletions examples/kubernetes/templates/mpijob-helm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
{{- if and .Values.numNodes (gt (int .Values.numNodes) 1) }}
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: {{ .Release.Name }}-mpijob
spec:
slotsPerWorker: {{ .Values.slotsPerWorker }}
runPolicy:
cleanPodPolicy: {{ .Values.image.cleanPodPolicy }}
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
hostIPC: {{ .Values.hostIPC }}
containers:
- name: {{ .Release.Name }}-mpijob-container
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
command: ["/bin/bash", "-c"]
args:
- >-
/usr/bin/ssh-keygen -A;
/usr/sbin/sshd;
HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
echo $MASTER_ADDR;
NUM_NODES=$(wc -l < $HOSTSFILE);
CARDS_PER_NODE={{ .Values.numCards }};
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));

SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
pip install -r optimum-habana/examples/language-modeling/requirements.txt;

eval $SETUP_CMD;

mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;

mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install -r optimum-habana/examples/language-modeling/requirements.txt;

MODEL_PATH=/optimum-habana/examples/language-modeling;
cd $MODEL_PATH;
mpirun -np $N_CARDS --npernode $CARDS_PER_NODE \
--allow-run-as-root \
--bind-to core \
--map-by ppr:$CARDS_PER_NODE:node:PE=6 \
-rank-by core --report-bindings \
--tag-output \
--merge-stderr-to-stdout --prefix $MPI_ROOT \
-x MASTER_ADDR=$MASTER_ADDR \
-mca btl_tcp_if_include eth0 \
-mca oob_tcp_if_include eth0 \
-mca plm_rsh_no_tree_spawn 1 \
{{ .Values.command | join " " }};
resources:
limits:
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
requests:
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
Worker:
replicas: {{ .Values.numNodes }}
template:
spec:
hostIPC: {{ .Values.hostIPC }}
containers:
- name: {{ .Release.Name }}-mpijob-container
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
command: ["/bin/bash", "-c"]
args:
- >-
/usr/bin/ssh-keygen -A;
/usr/sbin/sshd;
sleep 365d;
resources:
{{- toYaml .Values.resources | nindent 16 }}
{{- end }}