Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

Open
himanshukandwal opened this issue Nov 3, 2024 · 0 comments · May be fixed by #2958
Open

ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

himanshukandwal opened this issue Nov 3, 2024 · 0 comments · May be fixed by #2958
Labels
bug Something isn't working

Comments

@himanshukandwal
Copy link
Contributor

himanshukandwal commented Nov 3, 2024

Describe the bug

When a Workflow is marked as deleted (TargetState.DELETE) via TaskDriver.delete API, then the workflow is deleted by the Helix Controller Pipeline in the next run. With deletion, the entries from ZK are deleted and the TaskDataCache is also updated. However, the other entries for this workflow are still present in ResourceConfig (Base) cache.

Usually, the next event which is in the pipeline is the ResourceConfigChange Event (as the resource config was deleted), which takes care of the resourceConfig cache update, but in case of a very busy cluster, other change event can be already earlier in the pipeline than the scheduled ResourceConfigChange event. Now, the ResourceConfig cache is updated selectively and not always, so the ResourceConfig cache keeps the deleted workflow entries, and when the Workflow (TaskDataCache) is prepared, these (previously deleted) workflow entries comes back in again. This causes, same workflows to be deleted multiple times. (Until we see the ResourceConfigChange or OnDemandRebalance event).

TLDR: In Busy Cluster, the resourceConfig cache can take time to be eventually consistent and this causes duplicate deletes of the same workflow.

Impact: Some customers delete and re-create workflow with same name and this behavior causes the recently deleted workflow to be deleted again (unexpectedly).

To Reproduce

Since this happens in a busy cluster where a lot of events are happening (indeterministically) so its not possible to reproduce this behavior. This however is evident in the cluster logs.

Expected behavior

Deletion should happen once only, and ResourceConfig cache should be updated immediately to serve the next events in a reliable and deterministic manner.

Additional context

Add any other context about the problem here.

@himanshukandwal himanshukandwal added the bug Something isn't working label Nov 3, 2024
@himanshukandwal himanshukandwal changed the title Cache Staleness causes same workflow to be deleted multiple times. ResourceConfigCache Staleness causes same workflow to be deleted multiple times. Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant