Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Recent reports of activity completion timing #807

Closed
cretz opened this issue Aug 30, 2024 · 4 comments · Fixed by #811
Closed

[Bug] Recent reports of activity completion timing #807

cretz opened this issue Aug 30, 2024 · 4 comments · Fixed by #811
Labels
bug Something isn't working

Comments

@cretz
Copy link
Member

cretz commented Aug 30, 2024

Describe the bug

In latest .NET (1.3.0) and Python SDKs (1.7.0), users are reporting seeing things like:

2024-08-30T15:15:19.610801Z WARN temporal_sdk_core::worker::activities: Network error while completing activity error=Status { code: Cancelled, message: "operation was canceled", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, "connection closed"))) }

Or

2024-08-23T18:12:54.800672Z WARN temporal_sdk_core::worker::activities::activity_heartbeat_manager: Error when recording heartbeat: Status { code: Cancelled, message: "operation was canceled", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, "connection closed"))) }

This happens intermittently (i.e. not all the time) and has been confirmed not to happen when the users downgrade (.NET 1.2.0 and Python 1.6.0). There are not much more details to go on. It is technically possible these are unrelated and lang-specific things. I have no replication from any user.

My initial suspicion was #780 (I saw something at #781 that fixed something but unfamiliar), but I see nothing obvious there. Maybe a Tonic upgrade at #782 messed things up here. I think the first thing is to confirm what the timeout is set to for these activity completion calls. I wonder if we can simulate a tad slow respond activity complete or activity heartbeat server side and see if it replicates.

@tetrakatech
Copy link

In case it's helpful for future debugging (either by project maintainers, or by anyone else), we ran into the same error in production using version 1.11.1 of the Typescript SDK, and it seems to have been resolved by upgrading to 1.11.3 (which includes #811).

Our log lines:

{
  "jsonPayload": {
    "metadata": {
      "error": "Status { code: Cancelled, message: \"operation was canceled\", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, \"connection closed\"))) }",
      "sdkComponent": "core",
      "label": "worker"
    },
    "message": "Network error while completing activity"
  },
  "timestamp": "2024-11-06T03:29:36.765000104Z",
  "severity": "WARNING",
}

None of us have seen the error when developing locally, but we saw it fairly commonly in our prod environment (k8s running inside GCP, each pod running both Temporal workers and an instance of the Temporal frontend in their own containers - so the network calls from worker to frontend are on localhost).

We saw the error ~180 times in the last week, on ~60k activity executions, so this was happening on say O(0.1%) of activity completions, and caused spurious timeouts of the activities despite successful completion.

They errors don't seem to correlate with load beyond what you'd expect for the overall error rate. Here's a graph of the error occurrence over the last week:
image

Against overall activity completion rates (successful or errored):
image

We see no abnormal logs from the Temporal frontend when the errors occur, and we sometimes see the error occurring even on activities that complete fairly quickly (< 10 seconds).

An example activity execution, and relevant logs (simplified):

2024-11-06 10:51:09.47 Temporal UI reports ActivityTaskScheduled
2024-11-06 10:51:09.49 Temporal UI reports ActivityTaskStarted
2024-11-06 10:51:09.509 [application] Received activity <activityName>
2024-11-06 10:51:09.510 [worker] Activity started
...more application logs...
2024-11-06 10:51:13.539 [application] Completed activity <activityName> (application-measured duration: 4030ms)
2024-11-06 10:51:13.539 [worker] Activity completed (worker-measured duration: 4029ms)
2024-11-06 10:51:13.542 [worker] WARN Network error while completing activity, Status { code: Cancelled, message: "operation was canceled", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, "connection closed"))) }
2024-11-06 10:54:09.49 Temporal UI reports ActivityTaskTimedOut

There are no Temporal frontend logs during that window or within a minute of it, and the logs on both sides of the event relate to other workflows encountering timeout errors as well.

If there's other information that would be helpful to provide, let me know.

@Sushisource
Copy link
Member

@tetrakatech Thanks for the follow up! Glad to hear the fix is working for you.

@pauldotknopf
Copy link

pauldotknopf commented Jan 16, 2025

I'm experiencing this same thing with version 1.3.1 of the .NET NuGet package/sdk.

[
  {
    "severityLevel": "Error",
    "outerId": "0",
    "message": "operation was canceled",
    "type": "Temporalio.Exceptions.RpcException",
    "id": "35969447",
    "parsedStack": [
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
        "level": 0
      },
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
        "level": 1
      },
      {
        "assembly": "Temporalio, Version=1.3.1.0, Culture=neutral, PublicKeyToken=null",
        "method": "Temporalio.Bridge.Client+<CallAsync>d__14`1.MoveNext",
        "level": 2
      },
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
        "level": 3
      },
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
        "level": 4
      },
      {
        "assembly": "Temporalio, Version=1.3.1.0, Culture=neutral, PublicKeyToken=null",
        "method": "Temporalio.Client.TemporalConnection+<InvokeRpcAsync>d__42`1.MoveNext",
        "level": 5
      },
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
        "level": 6
      },
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
        "level": 7
      },
      {
        "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
        "method": "System.Runtime.CompilerServices.TaskAwaiter`1.GetResult",
        "level": 8
      },
      {
        "assembly": "Private.Services, Version=0.2.73.0, Culture=neutral, PublicKeyToken=null",
        "method": "Private.Services.HealthChecks.TemporalConnectionHealthCheck+<CheckHealthAsync>d__2.MoveNext",
        "level": 9,
        "line": 20,
        "fileName": "/agent/_work/1/s/src/Private/src/Private.Services/HealthChecks/TemporalConnectionHealthCheck.cs"
      }
    ]
  }
]

@cretz
Copy link
Member Author

cretz commented Jan 16, 2025

@pauldotknopf - This may not be the same issue, it could be any number of things. Can you reliably replicate? Can you provide a small standalone replication? Can you replicate with the latest .NET SDK? If so, can you open an issue on the https://github.com/temporalio/sdk-dotnet repo with the replication so we can debug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants