-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925
Conversation
We rate limit based on time spent waiting for an endpoint to respond, but would like to have more insight on how much time is being spent outside of waiting for the endpoint since that is theoretically within our control.
Pass buckets, bucket_tag, and bucket_unit
I do think it is worth creating one more upper bucket (5 seconds or something like that, since we have seen tasks that are greater than a second and less than ten. Would be nice to get more insight there. |
I don't think we care (at least for now) about differentiating between buckets of 10 and 30 ms. I added a 5 second bucket, and removed two of the lower buckets.
I missed that you needed to pass a timing context down the call stack in order to take advantage of nestable timers. Rather than pass a context down, lets just time the request in fire_for_record and pass that value back up.
Going to make a substantial change to this so reviewers can hold off for (looking at you @millerdev) |
It is less disruptive to pass an optional timing context down the call stack since the caller can access the subtimer from the context it owns. The alternative requires changing the return signature of multiple functions, impacting callers that don't actually care about timing.
@millerdev and @dannyroberts this is ready to review now. |
corehq/motech/repeaters/tasks.py
Outdated
# round up to the nearest millisecond, meaning always at least 1ms | ||
report_repeater_usage(repeat_record.domain, milliseconds=int(fire_timer.duration * 1000) + 1) | ||
action = 'attempted' | ||
request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance that the list comprehension would result in an empty list? I think that would cause an error. Here's a suggestion to avoid that:
request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0] | |
request_duration = sum(sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was the least favorite part of my changes here, since it feels like there should be a nicer way to get sub timers. Given we pass the timing context into repeat_record.fire
right above, I felt like it should be safe to assume there will always be a subtimer available (otherwise, if the timing code in fire
is removed, this should fail loudly so that we remove it here as well? If it handles the lack of a timer gracefully, we might end up collecting data that we think is telling us one thing, but in fact just doesn't include timing the endpoint request at all. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so this code structure is functioning partly as an assertion. Seems reasonable as long as there is no way for it to fail without a bug in the code.
Product Description
Technical Summary
https://dimagi.atlassian.net/browse/SAAS-15798
We recently implemented rate limiting on repeaters based on time spent waiting for an endpoint, but are surprised to see slower timings even for rate limited requests. This should give us more insight into the time spent processing repeat records on our side, subtracting the time spent waiting for an endpoint to respond if it is an attempted forward.
If we leave these metrics in place longer term, we likely do not need to collect the current metrics around rate limited repeat records since this also tracks that, but I wanted to leave room to iterate on this before removing any existing metrics.
Feature Flag
Safety Assurance
Safety story
Automated test coverage
QA Plan
No
Rollback instructions
Labels & Review