Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Different tasks should create different ray clusters #82

Closed
tatiana opened this issue Oct 9, 2024 · 3 comments
Closed

[Bug] Different tasks should create different ray clusters #82

tatiana opened this issue Oct 9, 2024 · 3 comments
Assignees
Labels
bug Something isn't working customer

Comments

@tatiana
Copy link
Collaborator

tatiana commented Oct 9, 2024

Currently, as of 0.2.1, when running a DAG with multiple @task.ray, all the tasks are reusing the same ray cluster. They should create separate and independent clusters.

@tatiana tatiana added bug Something isn't working triage-needed customer labels Oct 9, 2024
@DavidSacerdote
Copy link

DavidSacerdote commented Oct 15, 2024

I think this might be driven by running Ray only in a single namespace, called "ray"

This means that the second Ray instance tries to access everything exactly like the first, including heartbeat, failover, etc. so it can't be a separate thing.

Putting the two Ray instances into their own namespaces would solve this, but presents a second set of issues; we need resources in addition to Ray to be running in that namespace, such as PersistentVolumeClaim like this one:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-scratch-new
  namespace: ray
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 6000Gi
  storageClassName: fsx
  volumeMode: Filesystem
  volumeName: data-scratch-new

With a single namespace, I can apply this yaml separately, and then have the Ray cluster config pick up the PVC and use it. If each Ray instance is in its own namespace, we will need a way to inject a series of configurations like this one into each namespace so that Ray can have full access to the necessary resources.

I'll also note that the details of how we do this are going to be AZ-specific, with each AZ needing to mount its own local resources (and never mount them in the wrong one)

@tatiana tatiana added this to the 0.3.0 milestone Oct 29, 2024
@pankajastro
Copy link
Collaborator

Hey @DavidSacerdote

The operator creates a cluster based on the given cluster specification, which includes the cluster name. An example of the cluster specification can be found here: https://github.com/astronomer/astro-provider-ray/blob/main/dev/dags/scripts/ray.yaml.

As mentioned in your comment above, "I think this might be driven by running Ray only in a single namespace, called "ray"" (#82 (comment)). The cluster name is hardcoded in the specification, so we can't have 2 clusters with the same name as the cluster name should be unique within a namespace.

Did you consider testing with a unique specification per task to ensure the cluster name remains unique?

We also have the update_if_exists parameter https://github.com/astronomer/astro-provider-ray/blob/main/ray_provider/operators/ray.py#L148 in the operator to control whether to use an existing cluster. Default value for this param is True

Recently, there was an attempt in PR: #67 to improve the configuration with dynamic cluster configuration. Additionally, we have an issue: #81 to track this topic. Do you think it will help in this regard, and are you looking for the same?

@tatiana
Copy link
Collaborator Author

tatiana commented Nov 27, 2024

This is being fixed as part of #81 and was tested using the branch issue-81-mess

@tatiana tatiana closed this as completed Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working customer
Projects
None yet
Development

No branches or pull requests

4 participants