Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925

Merged
merged 12 commits into from
Aug 15, 2024

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Jul 31, 2024

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-15798

We recently implemented rate limiting on repeaters based on time spent waiting for an endpoint, but are surprised to see slower timings even for rate limited requests. This should give us more insight into the time spent processing repeat records on our side, subtracting the time spent waiting for an endpoint to respond if it is an attempted forward.

If we leave these metrics in place longer term, we likely do not need to collect the current metrics around rate limited repeat records since this also tracks that, but I wanted to leave room to iterate on this before removing any existing metrics.

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

No

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

gherceg added 2 commits July 31, 2024 12:11
We rate limit based on time spent waiting for an endpoint to respond,
but would like to have more insight on how much time is being spent
outside of waiting for the endpoint since that is theoretically within
our control.
@gherceg gherceg changed the title Collect timings on time spent processing repeat records (excluding waiting for endpoints) Collect metrics on time spent processing repeat records (excluding waiting for endpoints) Jul 31, 2024
@gherceg
Copy link
Contributor Author

gherceg commented Jul 31, 2024

I do think it is worth creating one more upper bucket (5 seconds or something like that, since we have seen tasks that are greater than a second and less than ten. Would be nice to get more insight there.

I don't think we care (at least for now) about differentiating between
buckets of 10 and 30 ms. I added a 5 second bucket, and removed two of
the lower buckets.
@gherceg gherceg marked this pull request as ready for review July 31, 2024 18:31
@gherceg gherceg requested a review from kaapstorm as a code owner July 31, 2024 18:31
I missed that you needed to pass a timing context down the call stack in
order to take advantage of nestable timers. Rather than pass a context
down, lets just time the request in fire_for_record and pass that value
back up.
@gherceg
Copy link
Contributor Author

gherceg commented Aug 1, 2024

Going to make a substantial change to this so reviewers can hold off for (looking at you @millerdev)

gherceg added 4 commits August 1, 2024 13:51
It is less disruptive to pass an optional timing context down the call
stack since the caller can access the subtimer from the context it owns.
The alternative requires changing the return signature of multiple
functions, impacting callers that don't actually care about timing.
@gherceg
Copy link
Contributor Author

gherceg commented Aug 5, 2024

@millerdev and @dannyroberts this is ready to review now.

# round up to the nearest millisecond, meaning always at least 1ms
report_repeater_usage(repeat_record.domain, milliseconds=int(fire_timer.duration * 1000) + 1)
action = 'attempted'
request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that the list comprehension would result in an empty list? I think that would cause an error. Here's a suggestion to avoid that:

Suggested change
request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0]
request_duration = sum(sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was the least favorite part of my changes here, since it feels like there should be a nicer way to get sub timers. Given we pass the timing context into repeat_record.fire right above, I felt like it should be safe to assume there will always be a subtimer available (otherwise, if the timing code in fire is removed, this should fail loudly so that we remove it here as well? If it handles the lack of a timer gracefully, we might end up collecting data that we think is telling us one thing, but in fact just doesn't include timing the endpoint request at all. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so this code structure is functioning partly as an assertion. Seems reasonable as long as there is no way for it to fail without a bug in the code.

@gherceg gherceg merged commit 9f78e42 into master Aug 15, 2024
13 checks passed
@gherceg gherceg deleted the gh/repeaters/improve-timing-metrics branch August 15, 2024 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants