You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Workflow is marked as deleted (TargetState.DELETE) via TaskDriver.delete API, then the workflow is deleted by the Helix Controller Pipeline in the next run. With deletion, the entries from ZK are deleted and the TaskDataCache is also updated. However, the other entries for this workflow are still present in ResourceConfig (Base) cache.
Usually, the next event which is in the pipeline is the ResourceConfigChange Event (as the resource config was deleted), which takes care of the resourceConfig cache update, but in case of a very busy cluster, other change event can be already earlier in the pipeline than the scheduled ResourceConfigChange event. Now, the ResourceConfig cache is updated selectively and not always, so the ResourceConfig cache keeps the deleted workflow entries, and when the Workflow (TaskDataCache) is prepared, these (previously deleted) workflow entries comes back in again. This causes, same workflows to be deleted multiple times. (Until we see the ResourceConfigChange or OnDemandRebalance event).
TLDR: In Busy Cluster, the resourceConfig cache can take time to be eventually consistent and this causes duplicate deletes of the same workflow.
Impact: Some customers delete and re-create workflow with same name and this behavior causes the recently deleted workflow to be deleted again (unexpectedly).
To Reproduce
Since this happens in a busy cluster where a lot of events are happening (indeterministically) so its not possible to reproduce this behavior. This however is evident in the cluster logs.
Expected behavior
Deletion should happen once only, and ResourceConfig cache should be updated immediately to serve the next events in a reliable and deterministic manner.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
himanshukandwal
changed the title
Cache Staleness causes same workflow to be deleted multiple times.
ResourceConfigCache Staleness causes same workflow to be deleted multiple times.
Nov 3, 2024
Describe the bug
When a Workflow is marked as deleted (TargetState.DELETE) via
TaskDriver.delete
API, then the workflow is deleted by the Helix Controller Pipeline in the next run. With deletion, the entries from ZK are deleted and the TaskDataCache is also updated. However, the other entries for this workflow are still present in ResourceConfig (Base) cache.Usually, the next event which is in the pipeline is the
ResourceConfigChange
Event (as the resource config was deleted), which takes care of the resourceConfig cache update, but in case of a very busy cluster, other change event can be already earlier in the pipeline than the scheduledResourceConfigChange
event. Now, the ResourceConfig cache is updatedselectively
and not always, so the ResourceConfig cache keeps the deleted workflow entries, and when the Workflow (TaskDataCache) is prepared, these (previously deleted) workflow entries comes back in again. This causes, same workflows to be deleted multiple times. (Until we see theResourceConfigChange
orOnDemandRebalance
event).TLDR: In Busy Cluster, the resourceConfig cache can take time to be eventually consistent and this causes duplicate deletes of the same workflow.
Impact: Some customers delete and re-create workflow with same name and this behavior causes the recently deleted workflow to be deleted again (unexpectedly).
To Reproduce
Since this happens in a busy cluster where a lot of events are happening (indeterministically) so its not possible to reproduce this behavior. This however is evident in the cluster logs.
Expected behavior
Deletion should happen once only, and ResourceConfig cache should be updated immediately to serve the next events in a reliable and deterministic manner.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: