Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve server-client communication error handling #6578

Merged
merged 7 commits into from
Feb 25, 2025

Conversation

MetRonnie
Copy link
Member

@MetRonnie MetRonnie commented Jan 28, 2025

In #6499 we added a new field to the trigger GraphQL mutation, and this unfortunately broke the ability for cylc trigger at 8.4 to work on workflows running in 8.3 (I shall term this "inter-version server-client comms").

This PR does not fix that (we don't think a fix is feasible at this stage).

An additional problem is that the error handling for server-client comms is broken - that is what this PR fixes. And it attempts to ensure we have defined format for server-client comms which should reduce the chance of breaking inter-version server-client comms in future.

Before

# Workflow running at Cylc 8.3.6, CLI at Cylc 8.4.0
$ cylc trigger wflow//1/foo
Error processing command:
    AttributeError: 'list' object has no attribute 'values'

After

# Workflow running at Cylc 8.3.6, CLI at this Cylc <this PR>
$ cylc trigger wflow//1/foo
Error processing command:
    Exception: This command is no longer compatible with the version of Cylc running the workflow. Please stop & restart the workflow with Cylc 8.4.1.dev or higher.

[{'error': {'message': 'Unknown argument "onResume" on field "trigger" of type "Mutations".', 'traceback': ['graphql.error.base.GraphQLError: Unknown argument "onResume" on field "trigger" of type "Mutations".\n']}}]

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • No dependency changes
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • No docs needed
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@MetRonnie MetRonnie added the bug Something is wrong :( label Jan 28, 2025
@MetRonnie MetRonnie added this to the 8.4.1 milestone Jan 28, 2025
@MetRonnie MetRonnie self-assigned this Jan 28, 2025
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from 5dd62b7 to ee3102d Compare January 28, 2025 17:19
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from ee3102d to ff80511 Compare January 29, 2025 16:24
client = WorkflowRuntimeClient(one.workflow)
with monkeypatch.context() as mp:
mp.setattr(client, 'socket', Mock(recv=mock_recv))
mp.setattr(client, 'poller', Mock())
Copy link
Member Author

@MetRonnie MetRonnie Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future/historical reference, I originally tried mocking the poll method of client.poller, i.e.

mp.setattr(client.poller, 'poll', Mock())

but this resulted in pytest hanging near the end of running all integration tests; some strange interaction due to the ZMQ poller using threading or something, dunno.

Anyway, the reason for this monkeypatching is to get client.poller.poll() to return a truthy value so the socket can receive the mock response

# receive response
if self.poller.poll(timeout):
res = await self.socket.recv()

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made two suggestions, but neither should block this PR. :)

@MetRonnie MetRonnie requested a review from wxtim February 11, 2025 15:49
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand what bug this fixes.

The changes I can see:

  • messages/traceback field removed
  • cylc_version field added
  • Errors now force exceptions in multi mode (was this the bug?).
  • Refactoring.
  • Nicer error for the trigger incompatibility issue.

@MetRonnie
Copy link
Member Author

MetRonnie commented Feb 20, 2025

Can you help me understand what bug this fixes

Before

# Workflow running at Cylc 8.3.6, CLI at Cylc 8.4.0
$ cylc trigger wflow//1/foo
Error processing command:
    AttributeError: 'list' object has no attribute 'values'

After

# Workflow running at Cylc 8.3.6, CLI at this Cylc <this PR>
$ cylc trigger wflow//1/foo
Error processing command:
    Exception: This command is no longer compatible with the version of Cylc running the workflow. Please stop & restart the workflow with Cylc 8.4.1.dev or higher.

[{'error': {'message': 'Unknown argument "onResume" on field "trigger" of type "Mutations".', 'traceback': ['graphql.error.base.GraphQLError: Unknown argument "onResume" on field "trigger" of type "Mutations".\n']}}]

(Added to OP)

@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from 0fac2c7 to f751949 Compare February 20, 2025 16:31
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from b81932c to 9505212 Compare February 21, 2025 12:58
@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from 9505212 to 43cab17 Compare February 21, 2025 15:44
@oliver-sanders oliver-sanders self-requested a review February 24, 2025 11:53
@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 24, 2025

I think this PR is stripping away exception context?

I tried creating an internal error with this diff:

diff --git a/cylc/flow/network/resolvers.py b/cylc/flow/network/resolvers.py
index 79b91f97e..702b3f573 100644
--- a/cylc/flow/network/resolvers.py
+++ b/cylc/flow/network/resolvers.py
@@ -787,6 +787,7 @@ class Resolvers(BaseResolvers):
         cutoff: Any = None
     ):
         """Put or clear broadcasts."""
+        raise ValueError('Unsupported broadcast mode')
         if settings is not None:
             # Convert schema field names to workflow config setting names if
             # applicable:

Results:

# scheduler=8.4.0, client=*
ERROR: [{'error': {'message': 'Unsupported broadcast mode', 'traceback': ['graphql.error.located_error.GraphQLLocatedError: Unsupported broadcast mode\n']}}]

# scheduler=this branch, client=*
ERROR: {'broadcast': None}

@MetRonnie MetRonnie force-pushed the graphql-err-handling branch from aa12785 to 1415531 Compare February 24, 2025 15:22
@MetRonnie
Copy link
Member Author

Should be sorted in 1415531

}
but this is not 100% consistent unfortunately
"""
error: Union[Exception, str, dict]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is error an Exception or a str?

Copy link
Member Author

@MetRonnie MetRonnie Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I tried to change it to Exception | str to begin with, to simplify things, but realised it breaks inter-version compatibility. So presently I don't think it ever is an Exception or str. However, with these changes we should be able to change it to be Exception | str in future without breaking inter-version compatibility with this version onwards.

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 25, 2025

Using a diff along the lines of this:

diff --git a/cylc/flow/network/resolvers.py b/cylc/flow/network/resolvers.py
index 79b91f97e..702b3f573 100644
--- a/cylc/flow/network/resolvers.py
+++ b/cylc/flow/network/resolvers.py
@@ -787,6 +787,7 @@ class Resolvers(BaseResolvers):
         cutoff: Any = None
     ):
         """Put or clear broadcasts."""
+        raise ValueError('Unsupported broadcast mode')
         if settings is not None:
             # Convert schema field names to workflow config setting names if
             # applicable:

We will now get an error something like this:

ClientError: {'message': 'Unsupported broadcast mode'}
  • We are reporting the string ClientError for an error that was received from the server?! Doesn't really make sense.
  • We are presenting the error as a dictionary (this is the same as the previous behaviour, but can be easily improved).

This patch changes the string ClientError to RequestError and uses the message field from the dictionary rather than the whole dictionary:

diff --git a/cylc/flow/exceptions.py b/cylc/flow/exceptions.py
index 802cfaaa9..a3eff7388 100644
--- a/cylc/flow/exceptions.py
+++ b/cylc/flow/exceptions.py
@@ -285,6 +285,10 @@ class ClientError(CylcError):
         return ret

+class RequestError(ClientError):
+    """Represents an error returned by the server."""
+
+
 class WorkflowStopped(ClientError):
     """The Cylc scheduler you attempted to connect to is stopped."""
diff --git a/cylc/flow/network/client.py b/cylc/flow/network/client.py
index 656aefd5c..89dac4cbb 100644
--- a/cylc/flow/network/client.py
+++ b/cylc/flow/network/client.py
@@ -40,10 +40,10 @@ from cylc.flow import (
     __version__ as CYLC_VERSION,
 )
 from cylc.flow.exceptions import (
-    ClientError,
     ClientTimeout,
     ContactFileExists,
     CylcError,
+    RequestError,
     WorkflowStopped,
 )
 from cylc.flow.hostuserutil import get_fqdn_by_host
@@ -332,17 +332,23 @@ class WorkflowRuntimeClient(  # type: ignore[misc]
             return response['data']
         except KeyError:
             error = response.get('error')
-            if not error:
-                error = (
+            error_mesage: str
+            if isinstance(error, (str, Exception)):
+                error_mesage = str(error)
+            elif error is None:
+                error_mesage = (
                     f"Received invalid response for Cylc {CYLC_VERSION}: "
                     f"{response}"
                 )
                 wflow_cylc_ver = response.get('cylc_version')
                 if wflow_cylc_ver and wflow_cylc_ver != CYLC_VERSION:
-                    error += (
+                    error_mesage += (
                         f"\n(Workflow is running in Cylc {wflow_cylc_ver})"
                     )
-            raise ClientError(str(error)) from None
+            else:
+                error_mesage = error.get('message', 'ERROR')
+
+            raise RequestError(error_mesage) from None
 
     def get_header(self) -> dict:
         """Return "header" data to attach to each request for traceability.

@oliver-sanders oliver-sanders merged commit d8c355a into cylc:8.4.x Feb 25, 2025
26 of 27 checks passed
@MetRonnie MetRonnie deleted the graphql-err-handling branch February 25, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants