ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

himanshukandwal · 2024-11-03T23:02:23Z

Describe the bug

When a Workflow is marked as deleted (TargetState.DELETE) via TaskDriver.delete API, then the workflow is deleted by the Helix Controller Pipeline in the next run. With deletion, the entries from ZK are deleted and the TaskDataCache is also updated. However, the other entries for this workflow are still present in ResourceConfig (Base) cache.

Usually, the next event which is in the pipeline is the ResourceConfigChange Event (as the resource config was deleted), which takes care of the resourceConfig cache update, but in case of a very busy cluster, other change event can be already earlier in the pipeline than the scheduled ResourceConfigChange event. Now, the ResourceConfig cache is updated selectively and not always, so the ResourceConfig cache keeps the deleted workflow entries, and when the Workflow (TaskDataCache) is prepared, these (previously deleted) workflow entries comes back in again. This causes, same workflows to be deleted multiple times. (Until we see the ResourceConfigChange or OnDemandRebalance event).

TLDR: In Busy Cluster, the resourceConfig cache can take time to be eventually consistent and this causes duplicate deletes of the same workflow.

Impact: Some customers delete and re-create workflow with same name and this behavior causes the recently deleted workflow to be deleted again (unexpectedly).

To Reproduce

Since this happens in a busy cluster where a lot of events are happening (indeterministically) so its not possible to reproduce this behavior. This however is evident in the cluster logs.

Expected behavior

Deletion should happen once only, and ResourceConfig cache should be updated immediately to serve the next events in a reliable and deterministic manner.

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

himanshukandwal added the bug Something isn't working label Nov 3, 2024

himanshukandwal changed the title ~~Cache Staleness causes same workflow to be deleted multiple times.~~ ResourceConfigCache Staleness causes same workflow to be deleted multiple times. Nov 3, 2024

himanshukandwal linked a pull request Nov 3, 2024 that will close this issue

[apache/helix] -- Added cache refresh trigger after cleaning up of a workflow. #2958

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

himanshukandwal commented Nov 3, 2024 •

edited

Loading

ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

ResourceConfigCache Staleness causes same workflow to be deleted multiple times. #2957

Comments

himanshukandwal commented Nov 3, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

himanshukandwal commented Nov 3, 2024 •

edited

Loading