Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aci): enqueue workflows for delayed processing #83548

Merged
merged 11 commits into from
Jan 21, 2025

Conversation

cathteng
Copy link
Member

@cathteng cathteng commented Jan 15, 2025

Adds the following logic to account for delayed processing of slow conditions:

  1. Process fast conditions in the WHEN DataConditionGroup (DCG), note the workflows that need to have their slow condition(s) checked before proceeding
  2. For workflows that need their slow conditions checked, evaluate all their IF DCGs to determine which ones would fire if the slow condition(s) pass
  3. For workflows that need their slow conditions checked + have passing IF DCGs, enqueue them in the buffer for delayed processing. We collect the IF DCGs so we know which actions to fire if the slow conditions pass via DataConditionGroupAction, this is so we can only evaluate slow conditions in delayed processing.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 98.18182% with 2 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/workflow_engine/models/workflow.py 83.33% 1 Missing ⚠️
src/sentry/workflow_engine/processors/workflow.py 96.55% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #83548      +/-   ##
==========================================
+ Coverage   87.54%   87.60%   +0.05%     
==========================================
  Files        9408     9489      +81     
  Lines      537825   539196    +1371     
  Branches    21176    21176              
==========================================
+ Hits       470859   472370    +1511     
+ Misses      66618    66478     -140     
  Partials      348      348              

@cathteng cathteng marked this pull request as ready for review January 16, 2025 15:43
@cathteng cathteng requested a review from a team as a code owner January 16, 2025 15:43
Copy link
Member

@ceorourke ceorourke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

@cathteng cathteng requested review from a team as code owners January 16, 2025 22:38
Copy link
Contributor

@saponifi3d saponifi3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the biggest change here is that we should not be filtering the workflows when we are trying to evaluate them. instead, we should filter before invoking evaluate. by filtering inside of the evaluation, it means we wouldn't be able to re-use this evaluation method in slow processing.

src/sentry/workflow_engine/processors/workflow.py Outdated Show resolved Hide resolved
src/sentry/workflow_engine/processors/workflow.py Outdated Show resolved Hide resolved
Comment on lines 49 to 53
if random.random() < 0.01:
logger.info(
"process_workflows.workflow_enqueued",
extra={"workflow": workflow.id, "group": event.group.id, "project": project_id},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we audit the logging in here? I'm not sure we need a lot of these info logs anymore (this one for example is being sampled to 1% - which isn't super valuable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mifu67 are the logs you've downsampled to 1% for enqueue rules for delayed processing still useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

she just did that cause they were noisy, i think there are a number of info logs here that we need to audit.

I'd recommend only keeping things that make sense to you. if you have any questions lemme know :)

src/sentry/workflow_engine/processors/workflow.py Outdated Show resolved Hide resolved
Comment on lines 74 to 88
def evaluate_workflow_triggers(
workflows: set[Workflow], job: WorkflowJob
) -> tuple[set[Workflow], set[Workflow]]:
triggered_workflows: set[Workflow] = set()
workflows_to_enqueue: set[Workflow] = set()

for workflow in workflows:
if workflow.evaluate_trigger_conditions(job):
triggered_workflows.add(workflow)
else:
if get_slow_conditions(workflow):
# enqueue to be evaluated later
workflows_to_enqueue.add(workflow)

return triggered_workflows
return triggered_workflows, workflows_to_enqueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should be doing filtering or anything here - this method should be a pure method to evaluate the workflow triggers and that's it; if we want to filter the workflows being evaluated we should do that before evaluating them.

let's update the code to have process_workflows figure out what is fast / slow conditions, then filter based on fast / slow conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only enqueue workflows that need to have slow conditions evaluated because they don't pass after evaluating the fast conditions alone. are you saying to evaluate the workflows with slow conditions separately? some of them might be triggered immediately and some of them may have to be enqueued

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to only return triggered_workflows

…ate slow conditions when the data is available
@@ -59,7 +59,7 @@ def get_result(model: TSDBModel, group_ids: list[int]) -> dict[int, int]:


@condition_handler_registry.register(Condition.EVENT_FREQUENCY_COUNT)
class EventFrequencyCountHandler(EventFrequencyConditionHandler, DataConditionHandler[int]):
class EventFrequencyCountHandler(EventFrequencyConditionHandler, DataConditionHandler[WorkflowJob]):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this to evaluate WorkflowJob so we can reuse evaluate_workflow_triggers in delayed processing, and populate snuba_results inside WorkflowJob after we make the snuba queries

def evaluate_value(value: list[int], comparison: Any) -> DataConditionResult:
if len(value) != 2:
def evaluate_value(value: WorkflowJob, comparison: Any) -> DataConditionResult:
if len(value.get("snuba_results", [])) != 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a common scenario or a weird snuba blip?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible we don't have the snuba results when we are evaluating the triggers outside of delayed processing

buffer.backend.push_to_sorted_set(key=WORKFLOW_ENGINE_BUFFER_LIST_KEY, value=project_id)

if_dcgs = workflow_action_groups.get(workflow.id, [])
if not if_dcgs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads a little strange - why is the var name if_dcgs? could it just be dcgs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are IF data condition groups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe we call them workflow_action_filters? (since that should be the type of DCG here)

Copy link
Contributor

@saponifi3d saponifi3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overal, lgtm. i think we can do a bit more cleanup here, but i don't think we need to block on that.

🙏 thanks for addressing the feedback!

buffer.backend.push_to_sorted_set(key=WORKFLOW_ENGINE_BUFFER_LIST_KEY, value=project_id)

if_dcgs = workflow_action_groups.get(workflow.id, [])
if not if_dcgs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe we call them workflow_action_filters? (since that should be the type of DCG here)

# enqueue to be evaluated later
workflows_to_enqueue.add(workflow)

enqueue_workflows(workflows_to_enqueue, job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: save the couple of cpu cycles and only enqueue if we have something to enqueue

Suggested change
enqueue_workflows(workflows_to_enqueue, job)
if workflows_to_enqueue:
enqueue_workflows(workflows_to_enqueue, job)

@@ -40,6 +40,7 @@ class WorkflowJob(EventJob, total=False):
has_alert: bool
has_escalated: bool
workflow: Workflow
snuba_results: list[int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 is the value in this that we can re-use evaluate_workflows method?

i think this is a bit of a smell that the abstraction might not be quite right in either delayed processing or the evaluate_workflow_triggers 🤔

mind adding a TODO here so i can come back and take a look? i'm not sure if this is the best approach, but seems okay for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the value is the we can use at least the evaluate_workflow_triggers function. we'll already have processed the actions we can possibly fire before enqueuing so all we need to do is process the slow conditions, but i'm also not sure if it's the best way to do it

@cathteng cathteng merged commit 3caa22e into master Jan 21, 2025
49 checks passed
@cathteng cathteng deleted the cathy/aci/enqueue-workflows branch January 21, 2025 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants