[Bug] Different tasks should create different ray clusters #82

tatiana · 2024-10-09T09:01:17Z

Currently, as of 0.2.1, when running a DAG with multiple @task.ray, all the tasks are reusing the same ray cluster. They should create separate and independent clusters.

DavidSacerdote · 2024-10-15T22:15:00Z

I think this might be driven by running Ray only in a single namespace, called "ray"

This means that the second Ray instance tries to access everything exactly like the first, including heartbeat, failover, etc. so it can't be a separate thing.

Putting the two Ray instances into their own namespaces would solve this, but presents a second set of issues; we need resources in addition to Ray to be running in that namespace, such as PersistentVolumeClaim like this one:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-scratch-new
  namespace: ray
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 6000Gi
  storageClassName: fsx
  volumeMode: Filesystem
  volumeName: data-scratch-new

With a single namespace, I can apply this yaml separately, and then have the Ray cluster config pick up the PVC and use it. If each Ray instance is in its own namespace, we will need a way to inject a series of configurations like this one into each namespace so that Ray can have full access to the necessary resources.

I'll also note that the details of how we do this are going to be AZ-specific, with each AZ needing to mount its own local resources (and never mount them in the wrong one)

pankajastro · 2024-10-30T18:58:50Z

Hey @DavidSacerdote

The operator creates a cluster based on the given cluster specification, which includes the cluster name. An example of the cluster specification can be found here: https://github.com/astronomer/astro-provider-ray/blob/main/dev/dags/scripts/ray.yaml.

As mentioned in your comment above, "I think this might be driven by running Ray only in a single namespace, called "ray"" (#82 (comment)). The cluster name is hardcoded in the specification, so we can't have 2 clusters with the same name as the cluster name should be unique within a namespace.

Did you consider testing with a unique specification per task to ensure the cluster name remains unique?

We also have the update_if_exists parameter https://github.com/astronomer/astro-provider-ray/blob/main/ray_provider/operators/ray.py#L148 in the operator to control whether to use an existing cluster. Default value for this param is True

Recently, there was an attempt in PR: #67 to improve the configuration with dynamic cluster configuration. Additionally, we have an issue: #81 to track this topic. Do you think it will help in this regard, and are you looking for the same?

tatiana · 2024-11-27T09:34:06Z

This is being fixed as part of #81 and was tested using the branch issue-81-mess

tatiana added bug Something isn't working triage-needed customer labels Oct 9, 2024

phanikumv assigned pankajkoti and pankajastro and unassigned pankajkoti Oct 21, 2024

tatiana added this to the 0.3.0 milestone Oct 29, 2024

tatiana removed the triage-needed label Nov 1, 2024

tatiana closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Different tasks should create different ray clusters #82

[Bug] Different tasks should create different ray clusters #82

tatiana commented Oct 9, 2024

DavidSacerdote commented Oct 15, 2024 •

edited

Loading

pankajastro commented Oct 30, 2024

tatiana commented Nov 27, 2024

[Bug] Different tasks should create different ray clusters #82

[Bug] Different tasks should create different ray clusters #82

Comments

tatiana commented Oct 9, 2024

DavidSacerdote commented Oct 15, 2024 • edited Loading

pankajastro commented Oct 30, 2024

tatiana commented Nov 27, 2024

DavidSacerdote commented Oct 15, 2024 •

edited

Loading