Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support dynamic configuration based on Airflow context #81

Closed
tatiana opened this issue Oct 9, 2024 · 2 comments · Fixed by #103
Closed

[Feature] Support dynamic configuration based on Airflow context #81

tatiana opened this issue Oct 9, 2024 · 2 comments · Fixed by #103
Assignees
Labels
customer enhancement New feature or request

Comments

@tatiana
Copy link
Collaborator

tatiana commented Oct 9, 2024

The Ray provider 0.2.1 allowed users to define a hard-coded configuration to materialize the Kubernetes cluster. The goal with this ticket is to allow users to define a function that can receive the Airflow context and can be used to generate the configuration dynamically by using context properties.

Previously, lots of progress was accomplished in the PR:
#67

We should:

  • Create a representative DAG of this feature
  • Make sure the implementation works as expected
  • Have tests covering this behaviour
@phanikumv
Copy link
Collaborator

Related to #67

tatiana added a commit that referenced this issue Nov 27, 2024
This PR flattens the provider folder structure to improve troubleshooting DAGs created with this provider and the provider's developer experience. It contains **breaking changes**, which is fine since the provider has not reached a stable state yet (we're under the 1.0 release).

As part of this PR, we are changing the import paths to existing decorators, hooks, operators and trigger
changed, as documented in the table below:


| Type      | Previous import path                        | Current import path                     |
|-----------|---------------------------------------------|-----------------------------------------|
| Decorator | ray_provider.decorators.ray.ray             | ray_provider.decorators.ray             |
| Hook      | ray_provider.hooks.ray.RayHook              | ray_provider.hooks.RayHook              |
| Operator  | ray_provider.operators.ray.DeleteRayCluster | ray_provider.operators.DeleteRayCluster |
| Operator  | ray_provider.operators.ray.SetupRayCluster  | ray_provider.operators.SetupRayCluster  |
| Operator  | ray_provider.operators.ray.SubmitRayJob     | ray_provider.operators.SubmitRayJob     |
| Trigger   | ray_provider.triggers.ray.RayJobTrigger     | ray_provider.triggers.RayJobTrigger     |


Previously, there were four modules within the `ray_provider` named `ray.py`. Not only was the project structure more nested and complex than it had to be, making the overall project maintainability harder without a benefit, but this caused many problems when troubleshooting the actual ray provider.

The following logs illustrate the problem from a debugging perspective:
![Screenshot 2024-11-26 at 15 10 22](https://github.com/user-attachments/assets/1cd87019-af9f-4c36-942e-e75983b2fada)

Different lines mentioning `ray.py` referenced different modules. It was time-consuming and frustrating for developers to troubleshoot and identify which `ray.py` the logs referred to: was it the `hooks/ray.py`, `operators/ray.py`, `trigerrer/ray.py`, or `deocrators/ray.py`? This sort of ambiguity significantly delayed the development of #81. We aim to reduce development time by introducing this change.
tatiana added a commit that referenced this issue Nov 27, 2024
Fix the local development environment using Astro CLI and a Kind Kubernetes cluster, and update the documentation.

While implementing #81, I faced several issues in the local development environment. Unfortunately, the existing documentation and configuration did not allow developers to run the example DAGs locally.

One of the main issues was that Airflow (running in Docker via Astro CLI) could not connect to Kind properly. Once that was solved, another critical problem was that Airflow could not access the Ray clusters created in the Kind Kubernetes cluster.

Some of the issues faced include:
* Inconsistent connection naming
* Missing Kind configuration
* Missing Docker overrides for Astro CLI
* MacOS Docker/kind network specifics

After applying all these changes, I was able to successfully run all the example DAGs locally:
![Screenshot 2024-11-26 at 14 42 21](https://github.com/user-attachments/assets/4c35e3e9-604b-4458-9107-f4945ffa2a67)

As illustrated below:

![Screenshot 2024-11-26 at 14 42 34](https://github.com/user-attachments/assets/f2a8a700-7d0a-43c6-a5f1-ea74d6f8b54b)

![Screenshot 2024-11-26 at 14 42 47](https://github.com/user-attachments/assets/cfe0324f-1325-4d2a-9ed5-68d8f7f61d63)

![Screenshot 2024-11-26 at 14 42 58](https://github.com/user-attachments/assets/23f1dfa5-1c7a-46b6-9096-34459bdc5047)

![Screenshot 2024-11-26 at 14 43 08](https://github.com/user-attachments/assets/6d407da4-68d9-41dc-8be7-0ddd62844308)

With this PR, I hope to save other Ray provider developers time.
tatiana added a commit that referenced this issue Nov 28, 2024
By catching `Exception`, we run into the ristk of hitting an unexpected exception that the program can't recover from, or worse, swallowing an important exception without properly logging it - a huge headache when trying to debug programs that are failing in weird ways

This was identified during #81 development.
tatiana added a commit that referenced this issue Nov 28, 2024
By catching `Exception`, we run into the ristk of hitting an unexpected exception that the program can't recover from, or worse, swallowing an important exception without properly logging it - a huge headache when trying to debug programs that are failing in weird ways

This was identified during #81 development.
tatiana added a commit that referenced this issue Nov 28, 2024
By catching `Exception`, we risk hitting an unexpected exception that the program can't recover from, or worse, swallowing an important exception without properly logging it - a massive headache when trying to debug programs that are failing in weird ways.

If this change leads to any exceptions that should be caught being raised, we'll have the opportunity to understand which are those exceptions and capture them, gracefully handling them.

This was identified during #81 development.
tatiana added a commit that referenced this issue Nov 28, 2024
By catching Exception, we risk hitting an unexpected exception that the program can't recover from, or worse, swallowing an important exception without properly logging it - a massive headache when trying to debug programs that are failing in weird ways.

If this change leads to any exceptions that should be caught being raised, we'll have the opportunity to understand which are those exceptions and capture them, handling them in a graceful way.

This was identified during #81 development.
tatiana added a commit that referenced this issue Nov 29, 2024
By catching `Exception`, we risk hitting an unexpected exception that the program can't recover from, or worse, swallowing an important exception without properly logging it - a massive headache when trying to debug programs that are failing in weird ways.

If this change raises any exceptions that should be caught, we'll have the opportunity to understand which exceptions to capture and handle them gracefully.

This was identified during the #81 development.
@tatiana tatiana closed this as completed Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants