Skip to content

Commit

Permalink
add user guide for parallelstore backup and recovery on GKE training …
Browse files Browse the repository at this point in the history
…workload (#951)

* add user guide for parallelstore backup and recovery on GKE training workload

* fix readme
  • Loading branch information
chengcongdu authored Feb 26, 2025
1 parent b69bc69 commit 799fd5a
Show file tree
Hide file tree
Showing 3 changed files with 244 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Data backup and recovery for Parallelstore

## Data Backup

### Prerequisites

Follow the instructions in [Create and connect to a Parallelstore instance from Google Kubernetes Engine](https://cloud.google.com/parallelstore/docs/connect-from-kubernetes-engine) to create a GKE cluster with Parallelstore enabled.

### [WIP] (Optional) Deploy a training workload that uses Parallelstore for training data and saving checkpoints

There is an existing effort to publish new [gpu-receipts](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main) that use Parallelstore as storage for training data and checkpoints. The plan is to merge it into the [current receipts for Llama 70b Nemo pretraining on GKE](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3mega/llama-3.1-70b/nemo-pretraining-gke) and become publicly available.

* [WIP]https://source.corp.google.com/h/aaie-internal-sandbox/experimental/+/main:davidsotomora/example-workload/training/a3mega/llama-3.1-70b/nemo-pretraining-gke/README.md

### Connect to your GKE cluster

```
gcloud container clusters get-credentials $CLUSTER_NAME --zone $CLUSTER_ZONE --project $PROJECT_ID
```

### Provision required permissions

Your GKE CronJob needs **roles/parallelstore.admin** and **roles/storage.admin** role to import and export data between GCS and ParallelStore.

#### Create GCP Service Account IAM SA

```
gcloud iam service-accounts create pstore-sa \
--project=$PROJECT_ID
```

#### Grant GCP Service Account with ParallelStore admin and GCS admin role

```
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member "serviceAccount:pstore-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role "roles/parallelstore.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member "serviceAccount:pstore-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role "roles/storage.admin"
```

#### Create GKE Service Account and allow it to impersonate GCP Service Account

```
kubeclt apply -f ./pstore-sa.yaml
```

##### Bind the GCP SA and GKE SA

```
gcloud iam service-accounts add-iam-policy-binding pstore-sa@$PROJECT_ID.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:$PROJECT_ID.svc.id.goog[default/pstore-sa]"
```

##### Annotate the GKE SA with GCP SA

```
kubectl annotate serviceaccount pstore-sa \
--namespace default \ iam.gke.io/[email protected]
```

#### Grant permission to ParallelStore Agent Service Account

* GCS_BUCKET: ***The GCS bucket URI in the format of “gs://<bucket_name>”***

```
gcloud storage buckets add-iam-policy-binding $GCS_BUCKET \
--member=serviceAccount:service-$PROJECT_NUMBER@gcp-sa-parallelstore.iam.gserviceaccount.com \
--role=roles/storage.admin
```

### Cronjob for periodically export data from Parallelstore to GCS

Update the below Variable base on your workload set up and deploy the Cronjob to your cluster.

* PSTORE_MOUNT_PATH: `e.g. "/data-ps"` ***The mount path of the Parallelstore Instance, should match the volumeMount defined for this container***

* PSTORE_PV_NAME: `e.g. "store-pv"` ***The name of the GKE Persistent Volume that points to your Parallelstore Instance. This should have been set up in your cluster as part of the prerequisites***

* PSTORE_PVC_NAME: `e.g. "pstore-pvc"` ***The name of the GKE Persistent Volume Claim that requests the usage of the Parallelstore Persistent Volume. This should have been set up in your cluster as part of the prerequisites***

* PSTORE_NAME: `e.g. "checkpoints-ps"` ***The name of the Parallelstore Instance that need backup***

* PSTORE_LOCATION: `e.g. "us-central1-a"` ***The location/zone of the Parallelstore Instance that need backup***

* SOURCE_PARALLELSTORE_PATH: `e.g. "/nemo-experiments/user-model-workload-ps-64/checkpoints/". ***The absolute path from the PStore instance, WITHOUT volume mount path, must start with “/”***

* DESTINATION_GCS_URI: `e.g. "gs://checkpoints-gcs/checkpoints/"` ***The GCS bucket path URI to a Cloud Storage bucket, or a path within a bucket, using the format "gs://<bucket_name>/<optional_path_inside_bucket>"***

* DELETE_AFTER_BACKUP: `e.g. false` ***Whether to delete old data from Parallelstore after backup and free up space***

```
kubeclt apply -f ./ps-to-gcs-backup.yaml
```


## Data Recovery

When disaster happens or the ParallelStore instance fails for any reason, you can either use the GKE Volume Populator to automatically preload data from GCS into a fully managed ParallelStore instance, or manually create a new ParallelStore Instance and import data from GCS backup.

### GKE Volume Populator

Detail instruction of how to use GKE Volume Populator to preload data into a new ParallelStore instance can be found in [Transfer data from Cloud Storage during dynamic provisioning using GKE Volume Populator ](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/volume-populator#preload-parallelstore)

### Manual recovery

* PARALLELSTORE_NAME ***The name of this Parallelstore instance***
CAPACITY_GB ***Storage capacity of the instance in GB, value from 12000 to 100000, in multiples of 4000***

* PARALLELSTORE_LOCATION ***Must be one of the Supported locations***

* NETWORK_NAME ***The name of the VPC network that you created in Configure a VPC network, must be the same network your GKE cluster uses and have private services access enabled***

* SOURCE_GCS_PATH: ***The GCS bucket path URI to a Cloud Storage bucket, or a path within a bucket, using the format "gs://<bucket_name>/<optional_path_inside_bucket>"***

* DESTINATION_PARALLELSTORE_URI: ***The absolute path from the PStore instance, WITHOUT volume mount path, must start with “/”***

#### Create a new Parallelstore Instance
```
gcloud beta parallelstore instances create $PARALLELSTORE_NAME \
--capacity-gib=$CAPACITY_GB \
--location=$PARALLELSTORE_LOCATION \
--network=$NETWORK_NAME \
--project=$PROJECT_ID
```

#### Import data from GCS
```
uuid=$(cat /proc/sys/kernel/random/uuid) # generate a uuid for the parallelstore data import request-id
gcloud beta parallelstore instances import-data $PARALLELSTORE_NAME \
--location=$PARALLELSTORE_LOCATION \
--source-gcs-bucket-uri=$SOURCE_GCS_PATH \
--destination-parallelstore-path=$DESTINATION_PARALLELSTORE_URI \
--request-id=$uuid \
--async
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: ps-to-gcs-backup
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
schedule: "0 * * * *"
successfulJobsHistoryLimit: 3
suspend: false
jobTemplate:
spec:
template:
metadata:
annotations:
gke-parallelstore/cpu-limit: "0"
gke-parallelstore/ephemeral-storage-limit: "0"
gke-parallelstore/memory-limit: "0"
gke-parallelstore/volumes: "true"
spec:
serviceAccountName: pstore-sa
containers:
- name: pstore-backup
image: google/cloud-sdk:slim
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -c
- |
#!/bin/bash
set -ex
# Retrieve modification timestamp for the latest file up to the minute
latest_folder_timestamp=$(find $PSTORE_MOUNT_PATH/$SOURCE_PARALLELSTORE_PATH -type d -printf '%T@ %p\n'| sort -n | tail -1 | cut -d' ' -f2- | xargs -I{} stat -c %x {} | xargs -I {} date -d {} +"%Y-%m-%d %H:%M")
# Start exporting from PStore to GCS
operation=$(gcloud beta parallelstore instances export-data $PSTORE_NAME \
--location=$PSTORE_LOCATION \
--source-parallelstore-path=$SOURCE_PARALLELSTORE_PATH \
--destination-gcs-bucket-uri=$DESTINATION_GCS_URI \
--async \
--format="value(name)")
# Wait until operation complete
while true; do
status=$(gcloud beta parallelstore operations describe $operation \
--location=$PSTORE_LOCATION \
--format="value(done)")
if [ "$status" == "True" ]; then
break
fi
sleep 60
done
# Check if export succeeded
error=$(gcloud beta parallelstore operations describe $operation \
--location=$PSTORE_LOCATION \
--format="value(error)")
if [ "$error" != "" ]; then
echo "!!! ERROR while exporting data !!!"
fi
# Delete the old files from PStore if requested
# This will NOT delete the folder with the latest modification timestamp
if $DELETE_AFTER_BACKUP && [ "$error" == "" ]; then
find $PSTORE_MOUNT_PATH/$SOURCE_PARALLELSTORE_PATH -type d -mindepth 1 |
while read dir; do
# Only delete folders that is modified earlier than the latest modification timestamp
folder_timestamp=$(stat -c %y $dir)
if [ $(date -d "$folder_timestamp" +%s) -lt $(date -d "$latest_folder_timestamp" +%s) ]; then
echo "Deleting $dir"
rm -rf "$dir"
fi
done
fi
env:
- name: PSTORE_MOUNT_PATH # mount path of the Parallelstore Instance, should match the volumeMount defined for this container
value: "/datacached"
- name: PSTORE_NAME # name of the Parallelstore Instance that need backup
value: "chdu-checkpoints-ps"
- name: PSTORE_LOCATION # location/zone of the Parallelstore Instance that need backup
value: "us-central1-a"
- name: SOURCE_PARALLELSTORE_PATH # absolut path from the PStore instance, WITHOUT volume mount path
value: "/nemo-experiments/user-model-workload-ps-64-2025-01-23-18-22-31/checkpoints/"
- name: DESTINATION_GCS_URI # GCS bucket uri used for storing backups, starting with "gs://"
value: "gs://chdu-checkpoints-gcs/checkpoints/"
- name: DELETE_AFTER_BACKUP # will delete old data from Parallelstore if true
value: "true"
volumeMounts:
- mountPath: /datacached # should match the value of env var PSTORE_MOUNT_PATH
name: pstore-cached
dnsPolicy: ClusterFirst
restartPolicy: OnFailure
terminationGracePeriodSeconds: 30
volumes:
- name: pstore-cached
persistentVolumeClaim:
claimName: parallelstore-pvc-cached
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@

# Service Account that have access to Parallelstore and GCS
apiVersion: v1
kind: ServiceAccount
metadata:
name: pstore-sa
namespace: default

0 comments on commit 799fd5a

Please sign in to comment.