-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core Feature] Failure-Node support #1506
Comments
cc @EngHabu / @cosmicBboy / @kanterov |
As old as this issue may be, we would absolutely LOVE this and it has been a stable feature on other orchestration engines (such as Kubeflow) for many years. |
@kumare3 is this a flytekit only change or would it require changes to propellor to propagate error state |
@dylanwilder the changes in the backend are mostly already done.. it's possible they have regressed because of the lack of end to end testing for it (because it's not implemented in flytekit).. would you be able to help with the flytekit side if things? |
Potentially we could pitch in since we'd like to see this, do you have anything outlining what's required? |
@dylanwilder I think this is really close. Maybe attempt to use it in an example and start the debugging journey from there? Happy to be pulled in once you get flytekit to produce the spec in case you deem it a problem with the backend... |
Thanks will take a look and see! |
@eapolinario is probably also looking into this |
wondering if there is a way to support inputs and outputs that are different from the workflow interface in the failure handler..below is an example use case we were trying to implement:
|
Was just brainstorming with @pingsutw now on this... here are my thoughts on UX: @workflow
def my_wf(a: int) -> str:
b = my_task(a=a)
flytekit.current_context().on_failure = clean_up(a=a, b=b)
return b
@task
def clean_up(err: Error, a: Optional[int], b: Optional[str]) -> str:
...
|
Need some discussion about
PRs for failure node. (still WIP) also cc @cosmicBboy @wild-endeavor |
|
is this okay to close? |
Motivation: Why do you think this is important?
Flyte backend supports a Failure-node for every workflow / sub-workflow. This is not currently exposed in flytekit (python or Java)
Goal: What should the final outcome look like, ideally?
Users should be able to define failure nodes for their workflows. An example for the python SDK is as follows
If my_wf() fails at any point during execution, it'll call my_error_handler() task and will pass some context (error info... etc.) to allow it to handle the error. The expectation is that my_error_handler() would do things like clean up resources, log/send customized notifications... etc. The thing it will NOT let you do is recover from failure... The execution of this workflow will still fail, be marked as failure and upstream callers will still be notified of its failure.
An example of sub-workflows:
In this case, my_parent_wf will continue running even if any of the nodes fails. The overall status of the execution will again be marked as failure but it'll let as many nodes as possible to execute... Whenever my_sub_wf fails, it'll invoke an instance of my_error_handler task to cleanup resources... etc.
Describe alternatives you've considered
NA
[Optional] Propose: Link/Inline OR Additional context
More discussion in
#1012
Related flytekit java issue - #1012
The text was updated successfully, but these errors were encountered: