Address false positive non-determinism scenarios during replay #1281

taylanisikdemir · 2023-10-25T02:10:12Z

What changed?
Non-determinism checks during workflow task processing is updated/fixed to properly replay a history and catch non-determinism scenarios that were missed before.

Why?
Currently there are some known scenarios where replay/shadow tests cannot capture as a non-deterministic scenario. These false positive test results mislead users and let them deploy non-deterministic workflow changes to prod which causes workflow failures.

Fix Details
Fix includes 2 main changes:

When in replay mode, don’t terminate the workflow task processing loop until the whole history is drained in replay mode
When in replay mode, use a modified version of this filter to avoid processing final workflow complete/fail/cancel/continueasnew events because the state machine doesn’t have a corresponding replay for these events and we want the post loop non-deterministic comparison to not fail.

The fix will change the conditions in this critical workflow task loop and it will catch these scenarios not only in Replay Test mode but also actual Replays happening in prod. A new worker option is introduced to turn off new checks if needed.

How did you test it?
Added more test cases to replay_test.go and to cover positive/negative cases.
There were existing test cases covering current broken behavior (no error when there should be). Those are updated as well.

Potential risks
The changes are expected to catch more non-determinism scenarios and this might surface some existing issues in user's workflows.
New strict non-determinism checks can be disabled via DisableStrictNonDeterminismCheck worker option if users need time until problematic workflow code is fixed (via versioning).

internal/internal_task_handlers.go

test/replaytests/replay_test.go

internal/internal_task_handlers.go

…rtial decisions

… not-obvious places

taylanisikdemir · 2023-12-14T21:45:25Z

test/replaytests/replay_test.go

+// Runs a history which ends with WorkflowExecutionContinuedAsNew. Replay fails because of the additional checks done
+// for continue as new case by replayWorkflowHistory().
+// This should not have any error because it's a valid continue as new case.
+func TestContinueAsNew(t *testing.T) {


this case will be handled in a follow up PR

yea, replayer shouldn't care. shadower might, but that's quite different semantically.

separately, now that I look at this... it feels strange that it doesn't return future, err. replaying to get the result (from code) seems entirely reasonable, no need to only return error...

Groxx · 2024-01-29T21:38:55Z

internal/internal_activity.go

 	// For everything we return result.
+	// Code reaches here for 2 cases:
+	// 	1. activity is executed by name (not the func pointer) and it wasn't registered
+	// 	2. activity is executed by func pointer and the signature indicates it doesn't/can't return data.
+	//		for example it only has one return parameter (which can only be be error).
 	return decodeArg(dataConverter, result, to)
 }


2 is incorrect, a string-executed activity that doesn't return data still calls deSerializeFnResultFromFnType and returns at line 430, doing nothing. It doesn't reach this line.

Also huh. On closer look, this whole chain of code is pretty strange:

deSerializeFunctionResult is only called if result is non-nil

deSerializeFnResultFromFnType is only called by this func, and it does nothing if result is nil

none of this needs the registered func type at all:

if everything agrees, the type does nothing to contribute to the behavior

if there if there is data + an out param + the registered activity does not return data, the earlier code silently does nothing (the data exists but is inaccessible)

if it's not registered, it falls back to just decodeArg here, which is what deSerializeFnResultFromFnType does anyway

... I think deSerializeFunctionResult might be more-correctly replaced with just a call to decodeArg instead.

(this doesn't really belong in this PR IMO, it's just an odd discovery)

TBH I need to re-debug this to remember how I ended up finding those 2 cases. Maybe I meant to add this somewhere else. From the code in this function it seems 2 is not possible.

Groxx · 2024-02-07T20:40:08Z

test/replaytests/choice.json

+  {
+    "eventId": 13,
+    "timestamp": 1679427717321780255,
+    "eventType": "ActivityTaskCompleted",


why add the completed event?

though reading more of this file: I don't think this is a realistic history... "completed" implies "ended normally, i.e. by returning" but we don't create activity-task-scheduled events for activities executed during the final decision (since their result cannot be observed)...

it wasn't a realistic history because it's missing the final decision task scheduled->started->completed events. I'm not quite getting why adding the activity completion looks invalid if that's what you meant

well, there are only 2 ways workflows can end:

return normally from the code (success or error) as the result of a completed decision task

cut off externally due to timeout/termination/etc

and everything that affects ^ this occurs in transactions (no interleaving between these):

a decision-complete and all its operations go in as one chunk

an external event that the workflow can observe goes in with the event and a decision-task-schedule to process it

activity complete, child start or complete, timer fire, etc

an external event that cannot be observed just gets appended

decision task start, activity start, etc

external cutoffs write the terminal event and close the workflow (could be converted into a continue-as-new due to cron or retry)

in this history we originally had:

decision task schedule->start->complete

activity scheduled

workflow complete

a normal one, which means it returned, which means this was the result of the most-recent decision task.

^ this is possible because, unlike child workflows, we don't "prune" activity-scheduled-s that are created in the final decision task. (we should consider doing that tbh, or keeping both for easier troubleshooting. I forget exactly how child workflows do this, but I suspect the workflow-end cancels the context -> cancels the state machine -> case decisionStateCreated: d.moveState(decisionStateCompleted, eventCancel) moves it out of a send-able state)

So in this final decision task, we executed one activity, did not wait on it, and returned immediately. Code like this would result in that:

func work(ctx workflow.Context) error { // ... workflow.ExecuteActivity(ctx, "ignored") // non-blocking return nil }

As of the current PR, we have this sequence:

decision task schedule->start->complete

activity scheduled

activity started

activity complete

workflow complete (still normal)

which is not possible.

it would mean we blocked after scheduling (this is necessary to submit it to the server, which is a necessary prerequisite to the activity starting)... which is fine because workflow.ExecuteActivity(...).Get(..) would end the batch at "scheduled".

however.

after that, asynchronously, the activity was started (this can be an isolated event because it's not observable, that's fine), and then completed (this requires both complete and decision-task-scheduled because it is observable, so already it's wrong), and then also the workflow function returned on its own without a decision task to learn about that result.

The only possible outcomes that include an activity that was started are things like either:

activity started -> activity completed -> decision scheduled -> [started -> completed (both or neither)] -> workflow completed

i.e. a decision ran and decided to end the workflow

activity started -> workflow cut off externally (timeout, terminate) without a decision task

it cannot be a "completed" type (with a possible exception for a continue-as-new but that isn't in this history either)

because "activity completed -> workflow completed" would be splitting the transaction that "activity completed" is involved in, at the very least.

so it's clearly hand-manipulated rather than recorded, because the hand-manipulation produces an impossible sequence. that would be fine for a regression-like "we had a bug that produced these histories, and we want to ensure it behaves like X" tests... but it doesn't help prove much of anything about how non-corrupted workflows behave, and that's pretty much the only purpose of this folder + PR.

tbh I'm kinda surprised it doesn't panic with an illegal transition or something. But it would be an easy check to miss in the state machine too (particularly because it's more "a collection of many sub-state-machines" rather than enforcing rules in the spaces between them)

Groxx · 2024-02-07T20:45:16Z

test/replaytests/continue_as_new_wf.go

+func ContinueAsNewWorkflow(ctx workflow.Context) error {
+	selector := workflow.NewSelector(ctx)
+	var signalResult string
+	signalName := "helloWorldSignal"
+	for {
+		signalChan := workflow.GetSignalChannel(ctx, signalName)
+		selector.AddReceive(signalChan, func(c workflow.Channel, more bool) {
+			c.Receive(ctx, &signalResult)
+			workflow.GetLogger(ctx).Info("Received age signalResult from signal!", zap.String("signal", signalName), zap.String("value", signalResult))
+		})
+		workflow.GetLogger(ctx).Info("Waiting for signal on channel.. " + signalName)
+		// Wait for signal
+		selector.Select(ctx)
+		if signalResult == "kill" {
+			return nil
+		}
+	}
+}


since the only history this is running on is the continue_as_new.json, this could be just

func ContinueAsNewWorkflow(ctx workflow.Context) error { return workflow.GetSignalChannel(ctx, "unused").Receive(ctx, nil) }

and it would be exactly equivalent: it decides to do nothing, waiting on an outside event to occur. not sure what the added complexity is buying us, unless it triggers some specific edge case?

afaik, continue ask new cases fail for only those scenarios where a workflow was marked continue as new and then the history has nothing after continue as new decision.
This looks like an attempt to cover the edge case where sometimes the workflow continues as new after receiving a signal. the continue as new case has been pretty inconsistent to begin with.

I would say let's keep the original behavior here to preserve the original scenario where we first observed this bug.

there's no signal in the history though, so it can't be a "received one, did not receive a second" kind of test. anything that waits forever without recording an event should be equivalent (signal chan, regular chan, waitgroup, etc all would work)

Groxx · 2024-02-07T23:45:57Z

internal/internal_task_handlers.go

+func isDecisionEventForReplay(eventType s.EventType) bool {
+	switch eventType {
+	case
+		s.EventTypeActivityTaskScheduled,
+		s.EventTypeActivityTaskCancelRequested,
+		s.EventTypeTimerStarted,
+		s.EventTypeTimerCanceled,
+		s.EventTypeCancelTimerFailed,
+		s.EventTypeMarkerRecorded,
+		s.EventTypeStartChildWorkflowExecutionInitiated,
+		s.EventTypeRequestCancelExternalWorkflowExecutionInitiated,
+		s.EventTypeSignalExternalWorkflowExecutionInitiated,
+		s.EventTypeUpsertWorkflowSearchAttributes:
+		return true
+	default:
+		return false
+	}
+}


To stick a comment in here as a random representative location, from chats a while back, for visibility:
I generally like this approach and I think it stands a pretty good chance of being an improvement...

... but I'm not (yet) confident that it's correct, nor that it doesn't worsen other scenarios. It's a pretty substantial change despite how small it is. From vague memory of my last attempt to verify, I ended up with more concerns than I started with, but no real evidence in either direction :\

Some of ^ this is "I'm not entirely sure what the end result is during replay" and some is "it's extremely hard to verify that this does not impact normal replays / restores original behavior when disabled" (because the code here is extremely convoluted and stateful). E.g. it could be written in a more obviously-safe way than it is, which would make the latter concern go away, but perhaps this is a cleaner end result than that would be. The former is... hard though.

It may be fine even if it does introduce new flaws, it's definitely not perfect now, but we need to have an informed decision. The good news is that we now have a LOT more internal replay-shadow-tests (compared to when this was first made) that we can run to discover what has changed - if it finds nothing but improvements against our fairly wide range of behavior in those tests, that will probably be good enough.
I just definitely do not trust our test suite here to prove correctness, they don't even use like half of our features nor do they use them in complicated ways. Issues they find are plenty useful, but they're very strange in a lot of ways and that makes them very very far from evidence of correctness.

So tl;dr: it just has to be sufficiently-validated or fully understood/described.
We've done some validation (thanks @agautam478 !) but we need to do a bit more, and none of the tests here seem like particularly strong evidence for what changed so I need to dig back in and try to reason about it more carefully.

All sounds good to me. At the time I looked into these false positives I focused on a subset of the potential scenarios. I'm not super confident this is completely safe change even though it looks like it is. Finding more scenarios and adding them as test cases should be next step as discussed offline. Due to other priorities I'm handing this over to you and @agautam478.

taylanisikdemir commented Oct 25, 2023

View reviewed changes

internal/internal_task_handlers.go Show resolved Hide resolved

agautam478 reviewed Oct 25, 2023

View reviewed changes

test/replaytests/replay_test.go Outdated Show resolved Hide resolved

internal/internal_task_handlers.go Show resolved Hide resolved

Groxx reviewed Oct 27, 2023

View reviewed changes

internal/internal_task_handlers.go Outdated Show resolved Hide resolved

taylanisikdemir mentioned this pull request Nov 3, 2023

Introduce DisableStrictNonDeterminismCheck worker option #1288

Merged

taylanisikdemir force-pushed the taylan/replayer branch from 25c0a09 to a9f22e3 Compare November 7, 2023 23:51

taylanisikdemir changed the title ~~WIP: Address false posivite non-determinism errors during replay~~ Address false positive non-determinism scenarios during replay Nov 8, 2023

taylanisikdemir marked this pull request as ready for review November 8, 2023 00:04

taylanisikdemir mentioned this pull request Nov 9, 2023

Release v1.1.0 #1290

Merged

taylanisikdemir added 8 commits November 16, 2023 13:29

replay test sequential

ec3ecd5

Address false positives in replay mode

f85fe93

fix

8647c6c

bring back lastDecisionEventsForReplayTest because it's needed for pa…

b5ec4d9

…rtial decisions

fix and improve test coverage of replays, add explanation comments to…

223fb5d

… not-obvious places

minor comment clarification

bff285f

use strict nondeterminism option to enable/disable new checks

b26326f

Add replay test case for problematic continue-as-new case

ff4f4d1

taylanisikdemir force-pushed the taylan/replayer branch from c9c3c25 to ff4f4d1 Compare November 16, 2023 21:32

Merge branch 'master' into taylan/replayer

51dbd59

agautam478 requested review from shijiesheng, jakobht, 3vilhamster, dkrotx and demirkayaender as code owners November 29, 2023 21:18

taylanisikdemir commented Dec 14, 2023

View reviewed changes

Merge branch 'master' into taylan/replayer

576e118

Groxx reviewed Jan 29, 2024

View reviewed changes

Merge branch 'master' into taylan/replayer

45560c4

Groxx reviewed Feb 7, 2024

View reviewed changes

agautam478 mentioned this pull request Feb 26, 2024

Tests for Continue as New and one additional scenario #1313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address false positive non-determinism scenarios during replay #1281

Address false positive non-determinism scenarios during replay #1281

taylanisikdemir commented Oct 25, 2023 •

edited

Loading

taylanisikdemir Dec 14, 2023

Groxx Jan 29, 2024 •

edited

Loading

Groxx Jan 29, 2024 •

edited

Loading

taylanisikdemir Feb 8, 2024

Groxx Feb 7, 2024 •

edited

Loading

taylanisikdemir Feb 8, 2024

Groxx Feb 8, 2024 •

edited

Loading

Groxx Feb 8, 2024 •

edited

Loading

Groxx Feb 7, 2024 •

edited

Loading

agautam478 Feb 7, 2024

Groxx Feb 7, 2024

Groxx Feb 7, 2024 •

edited

Loading

taylanisikdemir Feb 8, 2024

Address false positive non-determinism scenarios during replay #1281

Are you sure you want to change the base?

Address false positive non-determinism scenarios during replay #1281

Conversation

taylanisikdemir commented Oct 25, 2023 • edited Loading

taylanisikdemir Dec 14, 2023

Choose a reason for hiding this comment

Groxx Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

taylanisikdemir Feb 8, 2024

Choose a reason for hiding this comment

Groxx Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

taylanisikdemir Feb 8, 2024

Choose a reason for hiding this comment

Groxx Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

agautam478 Feb 7, 2024

Choose a reason for hiding this comment

Groxx Feb 7, 2024

Choose a reason for hiding this comment

Groxx Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

taylanisikdemir Feb 8, 2024

Choose a reason for hiding this comment

taylanisikdemir commented Oct 25, 2023 •

edited

Loading

Groxx Jan 29, 2024 •

edited

Loading

Groxx Jan 29, 2024 •

edited

Loading

Groxx Feb 7, 2024 •

edited

Loading

Groxx Feb 8, 2024 •

edited

Loading

Groxx Feb 8, 2024 •

edited

Loading

Groxx Feb 7, 2024 •

edited

Loading

Groxx Feb 7, 2024 •

edited

Loading