Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address false positive non-determinism scenarios during replay #1281

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

taylanisikdemir
Copy link
Member

@taylanisikdemir taylanisikdemir commented Oct 25, 2023

What changed?
Non-determinism checks during workflow task processing is updated/fixed to properly replay a history and catch non-determinism scenarios that were missed before.

Why?
Currently there are some known scenarios where replay/shadow tests cannot capture as a non-deterministic scenario. These false positive test results mislead users and let them deploy non-deterministic workflow changes to prod which causes workflow failures.

Fix Details
Fix includes 2 main changes:

  1. When in replay mode, don’t terminate the workflow task processing loop until the whole history is drained in replay mode
  2. When in replay mode, use a modified version of this filter to avoid processing final workflow complete/fail/cancel/continueasnew events because the state machine doesn’t have a corresponding replay for these events and we want the post loop non-deterministic comparison to not fail.

The fix will change the conditions in this critical workflow task loop and it will catch these scenarios not only in Replay Test mode but also actual Replays happening in prod. A new worker option is introduced to turn off new checks if needed.

How did you test it?
Added more test cases to replay_test.go and to cover positive/negative cases.
There were existing test cases covering current broken behavior (no error when there should be). Those are updated as well.

Potential risks
The changes are expected to catch more non-determinism scenarios and this might surface some existing issues in user's workflows.
New strict non-determinism checks can be disabled via DisableStrictNonDeterminismCheck worker option if users need time until problematic workflow code is fixed (via versioning).

test/replaytests/replay_test.go Outdated Show resolved Hide resolved
internal/internal_task_handlers.go Show resolved Hide resolved
@taylanisikdemir taylanisikdemir changed the title WIP: Address false posivite non-determinism errors during replay Address false positive non-determinism scenarios during replay Nov 8, 2023
@taylanisikdemir taylanisikdemir marked this pull request as ready for review November 8, 2023 00:04
@taylanisikdemir taylanisikdemir mentioned this pull request Nov 9, 2023
// Runs a history which ends with WorkflowExecutionContinuedAsNew. Replay fails because of the additional checks done
// for continue as new case by replayWorkflowHistory().
// This should not have any error because it's a valid continue as new case.
func TestContinueAsNew(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case will be handled in a follow up PR

Copy link
Member

@Groxx Groxx Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, replayer shouldn't care. shadower might, but that's quite different semantically.

separately, now that I look at this... it feels strange that it doesn't return future, err. replaying to get the result (from code) seems entirely reasonable, no need to only return error...

Comment on lines 434 to 440
// For everything we return result.
// Code reaches here for 2 cases:
// 1. activity is executed by name (not the func pointer) and it wasn't registered
// 2. activity is executed by func pointer and the signature indicates it doesn't/can't return data.
// for example it only has one return parameter (which can only be be error).
return decodeArg(dataConverter, result, to)
}
Copy link
Member

@Groxx Groxx Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is incorrect, a string-executed activity that doesn't return data still calls deSerializeFnResultFromFnType and returns at line 430, doing nothing. It doesn't reach this line.


Also huh. On closer look, this whole chain of code is pretty strange:

  • deSerializeFunctionResult is only called if result is non-nil
  • deSerializeFnResultFromFnType is only called by this func, and it does nothing if result is nil
  • none of this needs the registered func type at all:
    • if everything agrees, the type does nothing to contribute to the behavior
    • if there if there is data + an out param + the registered activity does not return data, the earlier code silently does nothing (the data exists but is inaccessible)
    • if it's not registered, it falls back to just decodeArg here, which is what deSerializeFnResultFromFnType does anyway

... I think deSerializeFunctionResult might be more-correctly replaced with just a call to decodeArg instead.

(this doesn't really belong in this PR IMO, it's just an odd discovery)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I need to re-debug this to remember how I ended up finding those 2 cases. Maybe I meant to add this somewhere else. From the code in this function it seems 2 is not possible.

{
"eventId": 13,
"timestamp": 1679427717321780255,
"eventType": "ActivityTaskCompleted",
Copy link
Member

@Groxx Groxx Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add the completed event?

though reading more of this file: I don't think this is a realistic history... "completed" implies "ended normally, i.e. by returning" but we don't create activity-task-scheduled events for activities executed during the final decision (since their result cannot be observed)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't a realistic history because it's missing the final decision task scheduled->started->completed events. I'm not quite getting why adding the activity completion looks invalid if that's what you meant

Copy link
Member

@Groxx Groxx Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, there are only 2 ways workflows can end:

  • return normally from the code (success or error) as the result of a completed decision task
  • cut off externally due to timeout/termination/etc

and everything that affects ^ this occurs in transactions (no interleaving between these):

  • a decision-complete and all its operations go in as one chunk
  • an external event that the workflow can observe goes in with the event and a decision-task-schedule to process it
    • activity complete, child start or complete, timer fire, etc
  • an external event that cannot be observed just gets appended
    • decision task start, activity start, etc
  • external cutoffs write the terminal event and close the workflow (could be converted into a continue-as-new due to cron or retry)

in this history we originally had:

  • decision task schedule->start->complete
  • activity scheduled
  • workflow complete
    • a normal one, which means it returned, which means this was the result of the most-recent decision task.

^ this is possible because, unlike child workflows, we don't "prune" activity-scheduled-s that are created in the final decision task. (we should consider doing that tbh, or keeping both for easier troubleshooting. I forget exactly how child workflows do this, but I suspect the workflow-end cancels the context -> cancels the state machine -> case decisionStateCreated: d.moveState(decisionStateCompleted, eventCancel) moves it out of a send-able state)

So in this final decision task, we executed one activity, did not wait on it, and returned immediately. Code like this would result in that:

func work(ctx workflow.Context) error {
  // ...
  workflow.ExecuteActivity(ctx, "ignored") // non-blocking
  return nil
}

As of the current PR, we have this sequence:

  • decision task schedule->start->complete
  • activity scheduled
  • activity started
  • activity complete
  • workflow complete (still normal)

which is not possible.

it would mean we blocked after scheduling (this is necessary to submit it to the server, which is a necessary prerequisite to the activity starting)... which is fine because workflow.ExecuteActivity(...).Get(..) would end the batch at "scheduled".

however.

after that, asynchronously, the activity was started (this can be an isolated event because it's not observable, that's fine), and then completed (this requires both complete and decision-task-scheduled because it is observable, so already it's wrong), and then also the workflow function returned on its own without a decision task to learn about that result.

The only possible outcomes that include an activity that was started are things like either:

  • activity started -> activity completed -> decision scheduled -> [started -> completed (both or neither)] -> workflow completed
    • i.e. a decision ran and decided to end the workflow
  • activity started -> workflow cut off externally (timeout, terminate) without a decision task
    • it cannot be a "completed" type (with a possible exception for a continue-as-new but that isn't in this history either)

because "activity completed -> workflow completed" would be splitting the transaction that "activity completed" is involved in, at the very least.


so it's clearly hand-manipulated rather than recorded, because the hand-manipulation produces an impossible sequence. that would be fine for a regression-like "we had a bug that produced these histories, and we want to ensure it behaves like X" tests... but it doesn't help prove much of anything about how non-corrupted workflows behave, and that's pretty much the only purpose of this folder + PR.

Copy link
Member

@Groxx Groxx Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'm kinda surprised it doesn't panic with an illegal transition or something. But it would be an easy check to miss in the state machine too (particularly because it's more "a collection of many sub-state-machines" rather than enforcing rules in the spaces between them)

Comment on lines +9 to +26
func ContinueAsNewWorkflow(ctx workflow.Context) error {
selector := workflow.NewSelector(ctx)
var signalResult string
signalName := "helloWorldSignal"
for {
signalChan := workflow.GetSignalChannel(ctx, signalName)
selector.AddReceive(signalChan, func(c workflow.Channel, more bool) {
c.Receive(ctx, &signalResult)
workflow.GetLogger(ctx).Info("Received age signalResult from signal!", zap.String("signal", signalName), zap.String("value", signalResult))
})
workflow.GetLogger(ctx).Info("Waiting for signal on channel.. " + signalName)
// Wait for signal
selector.Select(ctx)
if signalResult == "kill" {
return nil
}
}
}
Copy link
Member

@Groxx Groxx Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the only history this is running on is the continue_as_new.json, this could be just

func ContinueAsNewWorkflow(ctx workflow.Context) error {
	return workflow.GetSignalChannel(ctx, "unused").Receive(ctx, nil)
}

and it would be exactly equivalent: it decides to do nothing, waiting on an outside event to occur. not sure what the added complexity is buying us, unless it triggers some specific edge case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, continue ask new cases fail for only those scenarios where a workflow was marked continue as new and then the history has nothing after continue as new decision.
This looks like an attempt to cover the edge case where sometimes the workflow continues as new after receiving a signal. the continue as new case has been pretty inconsistent to begin with.

I would say let's keep the original behavior here to preserve the original scenario where we first observed this bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no signal in the history though, so it can't be a "received one, did not receive a second" kind of test. anything that waits forever without recording an event should be equivalent (signal chan, regular chan, waitgroup, etc all would work)

Comment on lines +268 to +285
func isDecisionEventForReplay(eventType s.EventType) bool {
switch eventType {
case
s.EventTypeActivityTaskScheduled,
s.EventTypeActivityTaskCancelRequested,
s.EventTypeTimerStarted,
s.EventTypeTimerCanceled,
s.EventTypeCancelTimerFailed,
s.EventTypeMarkerRecorded,
s.EventTypeStartChildWorkflowExecutionInitiated,
s.EventTypeRequestCancelExternalWorkflowExecutionInitiated,
s.EventTypeSignalExternalWorkflowExecutionInitiated,
s.EventTypeUpsertWorkflowSearchAttributes:
return true
default:
return false
}
}
Copy link
Member

@Groxx Groxx Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stick a comment in here as a random representative location, from chats a while back, for visibility:
I generally like this approach and I think it stands a pretty good chance of being an improvement...

... but I'm not (yet) confident that it's correct, nor that it doesn't worsen other scenarios. It's a pretty substantial change despite how small it is. From vague memory of my last attempt to verify, I ended up with more concerns than I started with, but no real evidence in either direction :\

Some of ^ this is "I'm not entirely sure what the end result is during replay" and some is "it's extremely hard to verify that this does not impact normal replays / restores original behavior when disabled" (because the code here is extremely convoluted and stateful). E.g. it could be written in a more obviously-safe way than it is, which would make the latter concern go away, but perhaps this is a cleaner end result than that would be. The former is... hard though.

It may be fine even if it does introduce new flaws, it's definitely not perfect now, but we need to have an informed decision. The good news is that we now have a LOT more internal replay-shadow-tests (compared to when this was first made) that we can run to discover what has changed - if it finds nothing but improvements against our fairly wide range of behavior in those tests, that will probably be good enough.
I just definitely do not trust our test suite here to prove correctness, they don't even use like half of our features nor do they use them in complicated ways. Issues they find are plenty useful, but they're very strange in a lot of ways and that makes them very very far from evidence of correctness.


So tl;dr: it just has to be sufficiently-validated or fully understood/described.
We've done some validation (thanks @agautam478 !) but we need to do a bit more, and none of the tests here seem like particularly strong evidence for what changed so I need to dig back in and try to reason about it more carefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sounds good to me. At the time I looked into these false positives I focused on a subset of the potential scenarios. I'm not super confident this is completely safe change even though it looks like it is. Finding more scenarios and adding them as test cases should be next step as discussed offline. Due to other priorities I'm handing this over to you and @agautam478.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants