User guide for persistent GCS w/ redis and kuberay (#49887)

Signed-off-by: Spencer Peterson <[email protected]>
ray-project · Feb 5, 2025 · 233296c · 233296c
1 parent acbe4ca
commit 233296c
Show file tree

Hide file tree

Showing 3 changed files with 157 additions and 0 deletions.
diff --git a/doc/source/cluster/kubernetes/user-guides.md b/doc/source/cluster/kubernetes/user-guides.md
@@ -14,6 +14,7 @@ user-guides/storage
 user-guides/config
 user-guides/configuring-autoscaling
 user-guides/kuberay-gcs-ft
+user-guides/kuberay-gcs-persistent-ft
 user-guides/gke-gcs-bucket
 user-guides/persist-kuberay-custom-resource-logs
 user-guides/persist-kuberay-operator-logs
@@ -47,6 +48,7 @@ at the {ref}`introductory guide <kuberay-quickstart>` first.
 * {ref}`kuberay-gpu`
 * {ref}`kuberay-tpu`
 * {ref}`kuberay-gcs-ft`
+* {ref}`kuberay-gcs-persistent-ft`
 * {ref}`persist-kuberay-custom-resource-logs`
 * {ref}`persist-kuberay-operator-logs`
 * {ref}`kuberay-dev-serve`

diff --git a/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md b/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md
@@ -16,6 +16,11 @@ Hence, we recommend enabling GCS fault tolerance on the RayService custom resour
 See {ref}`Ray Serve end-to-end fault tolerance documentation <serve-e2e-ft-guide-gcs>` for more information.
 ```
 
+```{seealso}
+If you need fault tolerance for Redis as well, see {ref}`Tuning Redis for a
+Persistent Fault Tolerant GCS <kuberay-gcs-persistent-ft>`.
+```
+
 ## Use cases
 
 * **Ray Serve**: The recommended configuration is enabling GCS fault tolerance on the RayService custom resource to ensure high availability.

diff --git a/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-persistent-ft.md b/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-persistent-ft.md
@@ -0,0 +1,150 @@
+(kuberay-gcs-persistent-ft)=
+
+# Tuning Redis for a Persistent Fault Tolerant GCS
+
+Using Redis to back up the Global Control Store (GCS) with KubeRay provides
+fault tolerance in the event that Ray loses the Ray Head. It allows the new Ray
+Head to rebuild its state by reading Redis.
+
+However, if Redis loses data, the Ray Head state is also lost.
+
+Therefore, you may want further protection in the event that your Redis cluster
+experiences partial or total failure. This guide documents how to configure and
+tune Redis for a highly available Ray Cluster with KubeRay.
+
+Tuning your Ray cluster to be highly available safeguards long-running jobs
+against unexpected failures and allows you to run Ray on commodity
+hardware/pre-emptible machines.
+
+## Solution overview
+
+KubeRay supports using Redis to persist the GCS, which allows you to move the
+point of failure (for data loss) outside Ray. However, you still have to
+configure Redis itself to be resilient to failures.
+
+This solution provisions a
+[Persistent Volume](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
+backed by hardware storage, which Redis will use to write regular snapshots. If
+you lose Redis or its host node, the Redis deployment can be restored from the
+snapshot.
+
+While Redis supports clustering, KubeRay only supports standalone (single
+replica) Redis, so it omits clustering.
+
+## Persistent storage
+
+Specialty storage volumes (like Google Cloud Storage FUSE or S3) don't support
+append operations, which Redis uses to efficiently write its Append Only File
+(AOF) log. When using these options, it's recommended to disable AOF.
+
+With GCP GKE and Azure AKS, the default storage classes are
+[persistent disks](https://cloud.google.com/kubernetes-engine/docs/concepts/persistent-volumes)
+and
+[SSD Azure disks](https://learn.microsoft.com/en-us/azure/aks/azure-csi-disk-storage-provision)
+respectively, and the only configuration needed to provision a disk is as
+follows:
+
+```
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: redis-data
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 8Gi
+  storageClassName: standard-rwo
+```
+
+On AWS, you must
+[Create a storage class](https://docs.aws.amazon.com/eks/latest/userguide/create-storage-class.html)
+yourself as well.
+
+## Tuning backups
+
+Redis supports database dumps at set intervals, which is good for fast recovery
+and high performance during normal operation.
+
+Redis also supports journaling at frequent intervals (or continuously), which
+can provide stronger durability at the cost of more disk writes (i.e., slower
+performance).
+
+A good starting point for backups is to enable both as shown in the following:
+
+```
+# Dump a backup every 60s, if there are 1000 writes since the prev. backup.
+save 60 1000
+dbfilename dump.rdb
+
+# Enable the append-only log file.
+appendonly yes
+appendfilename "appendonly.aof"
+
+```
+
+In this recommended configuration, Redis creates full backups every 60 s and
+updates the append-only every second, which is a reasonable balance for disk
+space, latency, and data safety.
+
+There are more options to configure the AOF, defaults shown here:
+
+```
+# Sync the log to disk every second.
+# Alternatives are "no" and "always" (every write).
+appendfsync everysec
+auto-aof-rewrite-percentage 100
+auto-aof-rewrite-min-size 64mb
+```
+
+You can view the full reference
+[here](https://raw.githubusercontent.com/redis/redis/refs/tags/7.4.0/redis.conf).
+
+If your job is generally idempotent and can resume from several minutes of state
+loss, you may prefer to disable the append-only log.
+
+If you prefer your job to lose as little state as possible, then you may prefer
+to set `appendfsync` to `always` so Redis stores all writes immediately.
+
+## Putting it together
+
+Edit
+[the full YAML](https://github.com/ray-project/kuberay/blob/master/config/samples/ray-cluster.persistent-redis.yaml)
+to your satisfaction and apply it:
+
+```
+kubectl apply -f config/samples/ray-cluster.persistent-redis.yaml
+```
+
+Verify that Kubernetes provisioned a disk and Redis is running:
+
+```
+kubectl get persistentvolumes
+kubectl get pods
+# Should see redis-0 running.
+```
+
+After running a job with some state in GCS, you can delete the ray head pod as
+well as the redis pod without data loss.
+
+## Verifying
+
+Forward connections to the ray cluster you just created with the {ref}`Ray
+kubectl plugin <kubectl-plugin>`:
+
+```
+$ kubectl ray session raycluster-external-redis
+```
+
+Then submit any Ray job of your choosing and let it run. When finished, delete
+all your pods:
+
+```
+$ kubectl delete pods --all
+```
+
+Wait for Kubernetes to provision the Ray head and enter a ready state. Then
+restart your port forwarding and view the Ray dashboard. You should find that
+Ray and Redis has persisted your job's metadata, despite the loss of the ray
+head as well as the Redis replica.