Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI Feature] Propagate dynamic workflow error messages from flytepropeller to UI #4466

Open
2 tasks done
fg91 opened this issue Nov 21, 2023 · 12 comments
Open
2 tasks done
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working enhancement New feature or request help wanted Extra attention is needed ui Admin console user interface

Comments

@fg91
Copy link
Member

fg91 commented Nov 21, 2023

Motivation: Why do you think this is important?

Our platform users regularly ping us with executions where a dynamic workflow failed to correctly register.

Flyte's UI doesn't provide helpful information to understand that the registration of the dynamic sub workflow has failed and how to fix it.

In this example, the dynamic task pod itself succeeded but the node failed:

image

We in the MLOps team then typically look in the flytepropeller logs to try to understand how the dynamic subworkflow needs to be fixed.

In this case, the propeller logs showed:

issue

Goal: What should the final outcome look like, ideally?

It would be very helpful if FlyteConsole would show the dynamic workflow related errors occurring in flytepropeller so that users can fix them themselves.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@fg91 fg91 added enhancement New feature or request ui Admin console user interface untriaged This issues has not yet been looked at by the Maintainers labels Nov 21, 2023
@eapolinario eapolinario added backlogged For internal use. Reserved for contributor team workflow. exo and removed untriaged This issues has not yet been looked at by the Maintainers labels Nov 30, 2023
@hamersaw
Copy link
Contributor

hamersaw commented Dec 4, 2023

I suspect this is related to #4483. These use a flyteidl error that is wrapped to support the DeepCopy function (here and here), My concern is that the latest update to k8s dependencies changed how the k8s client unmarshals CRDs into json structs.

@hamersaw
Copy link
Contributor

hamersaw commented Dec 4, 2023

@fg91 do you have a minimal repro? Think I found the issues and fixed similarly for branch node error reporting, but want to double check here.

@fg91
Copy link
Member Author

fg91 commented Dec 19, 2023

@fg91 do you have a minimal repro?

Sorry for the late reply @hamersaw . This is a minimal repro:

from flytekit import task, dynamic, workflow


@task
def foo(a = 1):  # Notice the missing type hint
    print(a)


@dynamic
def sub_wf():
    foo(a=1)


@workflow
def wf():
    sub_wf()

The pod succeeds but the dynamic task fails without any error shown in the UI:

Screenshot 2023-12-19 at 16 54 16

Propeller logs show:

E1219 15:53:40.603043       1 workers.go:103] error syncing '.../f96f28b7a6aa64183869': 0: [User] malformed dynamic workflow, caused by: Collected Errors: 1
	Error 0: Code: MismatchingTypes, Node Id: dn0, Description: Variable [a] (type [simple:INTEGER ]) doesn't match expected type [simple:NONE ].

1: 0: [User] malformed dynamic workflow, caused by: Collected Errors: 1
	Error 0: Code: MismatchingTypes, Node Id: dn0, Description: Variable [a] (type [simple:INTEGER ]) doesn't match expected type [simple:NONE ].

@hamersaw
Copy link
Contributor

hamersaw commented Dec 19, 2023

@fg91 thanks, this helps a lot. So I'm seeing this is a UI issue. Everything is stored correctly in FlyteAdmin (error included) but just not displayed in the UI. Running your example here, and then querying admin directly with grpcurl shows:

hamersaw@ragnarok:~/Downloads$ ./grpcurl -plaintext -d '{"id": {"execution_id":{"project":"flytesnacks","domain":"development","name":"f4e6b756dcb254e41851"}, "node_id":"n0"}}'  127.0.0.1:30081 flyteidl.service.AdminService.GetNodeExecution
{
  "id": {
    "nodeId": "n0",
    "executionId": {
      "project": "flytesnacks",
      "domain": "development",
      "name": "f4e6b756dcb254e41851"
    }
  },
  "inputUri": "s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f4e6b756dcb254e41851/n0/data/inputs.pb",
  "closure": {
    "error": {
      "code": "DynamicWorkflowBuildFailed",
      "message": "[User] malformed dynamic workflow, caused by: Collected Errors: 1\n\tError 0: Code: MismatchingTypes, Node Id: dn0, Description: Variable [a] (type [simple:INTEGER ]) doesn't match expected type [simple:NONE ].\n",
      "kind": "USER"
    },
    "phase": "FAILED",
    "startedAt": "2023-12-19T20:48:19.671343654Z",
    "duration": "4.689258075s",
    "createdAt": "2023-12-19T20:48:19.643655673Z",
    "updatedAt": "2023-12-19T20:48:24.360736789Z"
  },
  "metadata": {
    "specNodeId": "n0",
    "isDynamic": true
  }
}

So propeller is correctly reporting, just should be displayed. Will re-direct accordingly.

@hamersaw hamersaw removed the exo label Dec 19, 2023
@fg91
Copy link
Member Author

fg91 commented Dec 20, 2023

Thanks for testing @hamersaw 🙏

I understand that frontend is probably pretty busy already but it would help the UX of dynamic workflows a lot if they could implement this. (Unfortunately I have no clue about frontend things otherwise I'd be happy to tackle it myself 🙈)

Our ML engineers make heavy use of dynamic workflows and we in the ops team get a question like "Why is my Flyte workflow in a weird succeeded yet failed state again?" every few days.

@davidmirror-ops davidmirror-ops added the help wanted Extra attention is needed label Jan 18, 2024
@davidmirror-ops
Copy link
Contributor

01/18/2024 Contributor's meetup notes: Members agreed the UI should surface detailed error messages from propeller to aid users. John will investigate propagating errors appropriately in the UI.

Copy link

dosubot bot commented Jan 18, 2024

🤖

Hey @fg91, great to see you again! Trust you've been coding your way to glory since our last interaction. Let's dive into this new challenge you've got for us.

I understand the issue you're facing and I agree that improving the FlyteConsole to display dynamic workflow related errors would greatly enhance the user experience. Here's a high-level approach to achieve this:

  1. Expose Compile Errors through an API Endpoint: Create an API endpoint in the flytepropeller service that returns the list of compile errors. This endpoint should call the Errors() method of the CompileErrors interface to get the list of errors and return it in the response. You can refer to the flytepropeller/pkg/compiler/errors/error.go for more details on how errors are handled in flytepropeller.

  2. Fetch Compile Errors in FlyteConsole: In the FlyteConsole, make a request to the new API endpoint to fetch the list of compile errors when a dynamic workflow fails to register.

  3. Display Compile Errors in FlyteConsole: Display the error details in the FlyteConsole. The error details can be obtained by calling the Error() method of the CompileError and CompileErrorWithCause types.

Please note that this is a high-level approach and the actual implementation may vary based on the existing architecture and design of the flytepropeller and FlyteConsole services.

Regarding the error handling mechanism in the flytepropeller service, it primarily uses the errors package to create, manage, and check for compile errors. You can refer to the flytepropeller/pkg/compiler/workflow_compiler_test.go for more details on how errors are handled, especially for dynamic workflows.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@fgerzer
Copy link

fgerzer commented Feb 7, 2024

It would've been very helpful for me too to have this exposed; I just spent an hour trying to understand how my dynamic both failed and succeeded at the same time.

@fg91
Copy link
Member Author

fg91 commented Jun 12, 2024

01/18/2024 Contributor's meetup notes: Members agreed the UI should surface detailed error messages from propeller to aid users. John will investigate propagating errors appropriately in the UI.

@davidmirror-ops please excuse bringing this up again but is there any chance the frontend team could tackle this in one of the upcomming sprints? These types of errors in dynamic workflows are almost impossible to debug for users without the help of a platform engineer because you need to know that you have to search for the execution id in the propeller logs and our users struggle with this again and again 🙈

@jdanbrown
Copy link

+1 I'm staring at a similarly baffling error msg, and I'm so far trying and failing to guess my way through my team's code to see which of our many @dynamic's (and @task's?) might have a signature that looks like this. I haven't found it yet, so I'm currently pretty confused and trying to google instead, which is how I ended up here.

0: [User] malformed dynamic workflow, caused by: Collected Errors: 5
	Error 0: Code: MismatchingTypes, Node Id: dn10, Description: Variable [train_outputs] (type [simple:STRUCT]) doesn't match expected type [simple:FLOAT].
	Error 1: Code: MismatchingTypes, Node Id: dn7, Description: Variable [train_outputs] (type [simple:STRUCT]) doesn't match expected type [simple:STRING].
	Error 2: Code: ParameterNotBound, Node Id: dn10, Description: Parameter not bound [model_loss].
	Error 3: Code: ParameterNotBound, Node Id: dn10, Description: Parameter not bound [train_time_sec].
	Error 4: Code: ParameterNotBound, Node Id: dn7, Description: Parameter not bound [model_path].

@shashank-chaudhry-ai
Copy link

+1 I recently hit this issue as well. The issue was due to a type mismatch:

Error 0: Code: MismatchingTypes, Node Id: dn147-dn0, Description: Variable [o0] (type [blob:{}]) doesn't match expected type [simple:STRING].
        Error 1: Code: ParameterNotBound, Node Id: dn147-dn0, Description: Parameter not bound [frames_tar]

FlytePropeller logs did show the issue, but it didn't propagate to the UI. This change would make debugging dynamic workflow issues a lot easier.

@fg91 fg91 added the bug Something isn't working label Aug 15, 2024
@jpvotta
Copy link

jpvotta commented Aug 15, 2024

Update: we had a call today to refresh on this issue. Here is a quick summary:

  • To repro, need to downgrade flytekit since our improved error handling work catches the compilation error at registration time (good work 🙂). Still worth fixing because of course there can be other compilation issues with dynamic.
  • We talked about where to surface the error, and agreed that it should be surfaced at the node level (i.e. the dynamic task succeeded, but the dynamic workflow couldn't be executed, so the node failed).
  • We agreed that the most important thing was surfacing the dynamic compilation error somewhere in that right slider, ideally associated with the node. Union product/design are going to come up with a UI design here.
  • A secondary objective is to avoid showing users a task that says "succeeded" and a node that says "failed", since this is super confusing for non-power users. We have heard some users call it "Schrödinger's Task", which isn't the best look 😕.
  • Finally, we discussed the ultimate goal, which is to surface a class of errors which might be called "edge" errors (as opposed to "node" errors) - the task ran, but something happened in the system between the succeeded task run and the next node of the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working enhancement New feature or request help wanted Extra attention is needed ui Admin console user interface
Projects
None yet
Development

No branches or pull requests

8 participants