-
Notifications
You must be signed in to change notification settings - Fork 217
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add user guide for parallelstore backup and recovery on GKE training …
…workload (#951) * add user guide for parallelstore backup and recovery on GKE training workload * fix readme
- Loading branch information
1 parent
b69bc69
commit 799fd5a
Showing
3 changed files
with
244 additions
and
0 deletions.
There are no files selected for viewing
139 changes: 139 additions & 0 deletions
139
tutorials-and-examples/storage/parallelstore-backup-and-recovery/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
# Data backup and recovery for Parallelstore | ||
|
||
## Data Backup | ||
|
||
### Prerequisites | ||
|
||
Follow the instructions in [Create and connect to a Parallelstore instance from Google Kubernetes Engine](https://cloud.google.com/parallelstore/docs/connect-from-kubernetes-engine) to create a GKE cluster with Parallelstore enabled. | ||
|
||
### [WIP] (Optional) Deploy a training workload that uses Parallelstore for training data and saving checkpoints | ||
|
||
There is an existing effort to publish new [gpu-receipts](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main) that use Parallelstore as storage for training data and checkpoints. The plan is to merge it into the [current receipts for Llama 70b Nemo pretraining on GKE](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3mega/llama-3.1-70b/nemo-pretraining-gke) and become publicly available. | ||
|
||
* [WIP]https://source.corp.google.com/h/aaie-internal-sandbox/experimental/+/main:davidsotomora/example-workload/training/a3mega/llama-3.1-70b/nemo-pretraining-gke/README.md | ||
|
||
### Connect to your GKE cluster | ||
|
||
``` | ||
gcloud container clusters get-credentials $CLUSTER_NAME --zone $CLUSTER_ZONE --project $PROJECT_ID | ||
``` | ||
|
||
### Provision required permissions | ||
|
||
Your GKE CronJob needs **roles/parallelstore.admin** and **roles/storage.admin** role to import and export data between GCS and ParallelStore. | ||
|
||
#### Create GCP Service Account IAM SA | ||
|
||
``` | ||
gcloud iam service-accounts create pstore-sa \ | ||
--project=$PROJECT_ID | ||
``` | ||
|
||
#### Grant GCP Service Account with ParallelStore admin and GCS admin role | ||
|
||
``` | ||
gcloud projects add-iam-policy-binding $PROJECT_ID \ | ||
--member "serviceAccount:pstore-sa@$PROJECT_ID.iam.gserviceaccount.com" \ | ||
--role "roles/parallelstore.admin" | ||
gcloud projects add-iam-policy-binding $PROJECT_ID \ | ||
--member "serviceAccount:pstore-sa@$PROJECT_ID.iam.gserviceaccount.com" \ | ||
--role "roles/storage.admin" | ||
``` | ||
|
||
#### Create GKE Service Account and allow it to impersonate GCP Service Account | ||
|
||
``` | ||
kubeclt apply -f ./pstore-sa.yaml | ||
``` | ||
|
||
##### Bind the GCP SA and GKE SA | ||
|
||
``` | ||
gcloud iam service-accounts add-iam-policy-binding pstore-sa@$PROJECT_ID.iam.gserviceaccount.com \ | ||
--role roles/iam.workloadIdentityUser \ | ||
--member "serviceAccount:$PROJECT_ID.svc.id.goog[default/pstore-sa]" | ||
``` | ||
|
||
##### Annotate the GKE SA with GCP SA | ||
|
||
``` | ||
kubectl annotate serviceaccount pstore-sa \ | ||
--namespace default \ iam.gke.io/[email protected] | ||
``` | ||
|
||
#### Grant permission to ParallelStore Agent Service Account | ||
|
||
* GCS_BUCKET: ***The GCS bucket URI in the format of “gs://<bucket_name>”*** | ||
|
||
``` | ||
gcloud storage buckets add-iam-policy-binding $GCS_BUCKET \ | ||
--member=serviceAccount:service-$PROJECT_NUMBER@gcp-sa-parallelstore.iam.gserviceaccount.com \ | ||
--role=roles/storage.admin | ||
``` | ||
|
||
### Cronjob for periodically export data from Parallelstore to GCS | ||
|
||
Update the below Variable base on your workload set up and deploy the Cronjob to your cluster. | ||
|
||
* PSTORE_MOUNT_PATH: `e.g. "/data-ps"` ***The mount path of the Parallelstore Instance, should match the volumeMount defined for this container*** | ||
|
||
* PSTORE_PV_NAME: `e.g. "store-pv"` ***The name of the GKE Persistent Volume that points to your Parallelstore Instance. This should have been set up in your cluster as part of the prerequisites*** | ||
|
||
* PSTORE_PVC_NAME: `e.g. "pstore-pvc"` ***The name of the GKE Persistent Volume Claim that requests the usage of the Parallelstore Persistent Volume. This should have been set up in your cluster as part of the prerequisites*** | ||
|
||
* PSTORE_NAME: `e.g. "checkpoints-ps"` ***The name of the Parallelstore Instance that need backup*** | ||
|
||
* PSTORE_LOCATION: `e.g. "us-central1-a"` ***The location/zone of the Parallelstore Instance that need backup*** | ||
|
||
* SOURCE_PARALLELSTORE_PATH: `e.g. "/nemo-experiments/user-model-workload-ps-64/checkpoints/". ***The absolute path from the PStore instance, WITHOUT volume mount path, must start with “/”*** | ||
|
||
* DESTINATION_GCS_URI: `e.g. "gs://checkpoints-gcs/checkpoints/"` ***The GCS bucket path URI to a Cloud Storage bucket, or a path within a bucket, using the format "gs://<bucket_name>/<optional_path_inside_bucket>"*** | ||
|
||
* DELETE_AFTER_BACKUP: `e.g. false` ***Whether to delete old data from Parallelstore after backup and free up space*** | ||
|
||
``` | ||
kubeclt apply -f ./ps-to-gcs-backup.yaml | ||
``` | ||
|
||
|
||
## Data Recovery | ||
|
||
When disaster happens or the ParallelStore instance fails for any reason, you can either use the GKE Volume Populator to automatically preload data from GCS into a fully managed ParallelStore instance, or manually create a new ParallelStore Instance and import data from GCS backup. | ||
|
||
### GKE Volume Populator | ||
|
||
Detail instruction of how to use GKE Volume Populator to preload data into a new ParallelStore instance can be found in [Transfer data from Cloud Storage during dynamic provisioning using GKE Volume Populator ](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/volume-populator#preload-parallelstore) | ||
|
||
### Manual recovery | ||
|
||
* PARALLELSTORE_NAME ***The name of this Parallelstore instance*** | ||
CAPACITY_GB ***Storage capacity of the instance in GB, value from 12000 to 100000, in multiples of 4000*** | ||
|
||
* PARALLELSTORE_LOCATION ***Must be one of the Supported locations*** | ||
|
||
* NETWORK_NAME ***The name of the VPC network that you created in Configure a VPC network, must be the same network your GKE cluster uses and have private services access enabled*** | ||
|
||
* SOURCE_GCS_PATH: ***The GCS bucket path URI to a Cloud Storage bucket, or a path within a bucket, using the format "gs://<bucket_name>/<optional_path_inside_bucket>"*** | ||
|
||
* DESTINATION_PARALLELSTORE_URI: ***The absolute path from the PStore instance, WITHOUT volume mount path, must start with “/”*** | ||
|
||
#### Create a new Parallelstore Instance | ||
``` | ||
gcloud beta parallelstore instances create $PARALLELSTORE_NAME \ | ||
--capacity-gib=$CAPACITY_GB \ | ||
--location=$PARALLELSTORE_LOCATION \ | ||
--network=$NETWORK_NAME \ | ||
--project=$PROJECT_ID | ||
``` | ||
|
||
#### Import data from GCS | ||
``` | ||
uuid=$(cat /proc/sys/kernel/random/uuid) # generate a uuid for the parallelstore data import request-id | ||
gcloud beta parallelstore instances import-data $PARALLELSTORE_NAME \ | ||
--location=$PARALLELSTORE_LOCATION \ | ||
--source-gcs-bucket-uri=$SOURCE_GCS_PATH \ | ||
--destination-parallelstore-path=$DESTINATION_PARALLELSTORE_URI \ | ||
--request-id=$uuid \ | ||
--async | ||
``` |
98 changes: 98 additions & 0 deletions
98
tutorials-and-examples/storage/parallelstore-backup-and-recovery/ps-to-gcs-backup.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
apiVersion: batch/v1 | ||
kind: CronJob | ||
metadata: | ||
name: ps-to-gcs-backup | ||
spec: | ||
concurrencyPolicy: Forbid | ||
failedJobsHistoryLimit: 1 | ||
schedule: "0 * * * *" | ||
successfulJobsHistoryLimit: 3 | ||
suspend: false | ||
jobTemplate: | ||
spec: | ||
template: | ||
metadata: | ||
annotations: | ||
gke-parallelstore/cpu-limit: "0" | ||
gke-parallelstore/ephemeral-storage-limit: "0" | ||
gke-parallelstore/memory-limit: "0" | ||
gke-parallelstore/volumes: "true" | ||
spec: | ||
serviceAccountName: pstore-sa | ||
containers: | ||
- name: pstore-backup | ||
image: google/cloud-sdk:slim | ||
imagePullPolicy: IfNotPresent | ||
command: | ||
- /bin/bash | ||
- -c | ||
- | | ||
#!/bin/bash | ||
set -ex | ||
# Retrieve modification timestamp for the latest file up to the minute | ||
latest_folder_timestamp=$(find $PSTORE_MOUNT_PATH/$SOURCE_PARALLELSTORE_PATH -type d -printf '%T@ %p\n'| sort -n | tail -1 | cut -d' ' -f2- | xargs -I{} stat -c %x {} | xargs -I {} date -d {} +"%Y-%m-%d %H:%M") | ||
# Start exporting from PStore to GCS | ||
operation=$(gcloud beta parallelstore instances export-data $PSTORE_NAME \ | ||
--location=$PSTORE_LOCATION \ | ||
--source-parallelstore-path=$SOURCE_PARALLELSTORE_PATH \ | ||
--destination-gcs-bucket-uri=$DESTINATION_GCS_URI \ | ||
--async \ | ||
--format="value(name)") | ||
# Wait until operation complete | ||
while true; do | ||
status=$(gcloud beta parallelstore operations describe $operation \ | ||
--location=$PSTORE_LOCATION \ | ||
--format="value(done)") | ||
if [ "$status" == "True" ]; then | ||
break | ||
fi | ||
sleep 60 | ||
done | ||
# Check if export succeeded | ||
error=$(gcloud beta parallelstore operations describe $operation \ | ||
--location=$PSTORE_LOCATION \ | ||
--format="value(error)") | ||
if [ "$error" != "" ]; then | ||
echo "!!! ERROR while exporting data !!!" | ||
fi | ||
# Delete the old files from PStore if requested | ||
# This will NOT delete the folder with the latest modification timestamp | ||
if $DELETE_AFTER_BACKUP && [ "$error" == "" ]; then | ||
find $PSTORE_MOUNT_PATH/$SOURCE_PARALLELSTORE_PATH -type d -mindepth 1 | | ||
while read dir; do | ||
# Only delete folders that is modified earlier than the latest modification timestamp | ||
folder_timestamp=$(stat -c %y $dir) | ||
if [ $(date -d "$folder_timestamp" +%s) -lt $(date -d "$latest_folder_timestamp" +%s) ]; then | ||
echo "Deleting $dir" | ||
rm -rf "$dir" | ||
fi | ||
done | ||
fi | ||
env: | ||
- name: PSTORE_MOUNT_PATH # mount path of the Parallelstore Instance, should match the volumeMount defined for this container | ||
value: "/datacached" | ||
- name: PSTORE_NAME # name of the Parallelstore Instance that need backup | ||
value: "chdu-checkpoints-ps" | ||
- name: PSTORE_LOCATION # location/zone of the Parallelstore Instance that need backup | ||
value: "us-central1-a" | ||
- name: SOURCE_PARALLELSTORE_PATH # absolut path from the PStore instance, WITHOUT volume mount path | ||
value: "/nemo-experiments/user-model-workload-ps-64-2025-01-23-18-22-31/checkpoints/" | ||
- name: DESTINATION_GCS_URI # GCS bucket uri used for storing backups, starting with "gs://" | ||
value: "gs://chdu-checkpoints-gcs/checkpoints/" | ||
- name: DELETE_AFTER_BACKUP # will delete old data from Parallelstore if true | ||
value: "true" | ||
volumeMounts: | ||
- mountPath: /datacached # should match the value of env var PSTORE_MOUNT_PATH | ||
name: pstore-cached | ||
dnsPolicy: ClusterFirst | ||
restartPolicy: OnFailure | ||
terminationGracePeriodSeconds: 30 | ||
volumes: | ||
- name: pstore-cached | ||
persistentVolumeClaim: | ||
claimName: parallelstore-pvc-cached |
7 changes: 7 additions & 0 deletions
7
tutorials-and-examples/storage/parallelstore-backup-and-recovery/pstore-sa.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
|
||
# Service Account that have access to Parallelstore and GCS | ||
apiVersion: v1 | ||
kind: ServiceAccount | ||
metadata: | ||
name: pstore-sa | ||
namespace: default |