Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925

gherceg · 2024-07-31T16:23:57Z

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-15798

We recently implemented rate limiting on repeaters based on time spent waiting for an endpoint, but are surprised to see slower timings even for rate limited requests. This should give us more insight into the time spent processing repeat records on our side, subtracting the time spent waiting for an endpoint to respond if it is an attempted forward.

If we leave these metrics in place longer term, we likely do not need to collect the current metrics around rate limited repeat records since this also tracks that, but I wanted to leave room to iterate on this before removing any existing metrics.

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

No

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

We rate limit based on time spent waiting for an endpoint to respond, but would like to have more insight on how much time is being spent outside of waiting for the endpoint since that is theoretically within our control.

corehq/motech/repeaters/tasks.py

Pass buckets, bucket_tag, and bucket_unit

gherceg · 2024-07-31T18:29:27Z

I do think it is worth creating one more upper bucket (5 seconds or something like that, since we have seen tasks that are greater than a second and less than ten. Would be nice to get more insight there.

I don't think we care (at least for now) about differentiating between buckets of 10 and 30 ms. I added a 5 second bucket, and removed two of the lower buckets.

I missed that you needed to pass a timing context down the call stack in order to take advantage of nestable timers. Rather than pass a context down, lets just time the request in fire_for_record and pass that value back up.

corehq/motech/repeaters/models.py

gherceg · 2024-08-01T16:05:35Z

Going to make a substantial change to this so reviewers can hold off for (looking at you @millerdev)

It is less disruptive to pass an optional timing context down the call stack since the caller can access the subtimer from the context it owns. The alternative requires changing the return signature of multiple functions, impacting callers that don't actually care about timing.

gherceg · 2024-08-05T19:37:48Z

@millerdev and @dannyroberts this is ready to review now.

millerdev · 2024-08-06T19:00:45Z

corehq/motech/repeaters/tasks.py

+                # round up to the nearest millisecond, meaning always at least 1ms
+                report_repeater_usage(repeat_record.domain, milliseconds=int(fire_timer.duration * 1000) + 1)
+                action = 'attempted'
+                request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0]


Is there any chance that the list comprehension would result in an empty list? I think that would cause an error. Here's a suggestion to avoid that:

Suggested change

request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0]

request_duration = sum(sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER)

Yeah this was the least favorite part of my changes here, since it feels like there should be a nicer way to get sub timers. Given we pass the timing context into repeat_record.fire right above, I felt like it should be safe to assume there will always be a subtimer available (otherwise, if the timing code in fire is removed, this should fail loudly so that we remove it here as well? If it handles the lack of a timer gracefully, we might end up collecting data that we think is telling us one thing, but in fact just doesn't include timing the endpoint request at all. What do you think?

I see, so this code structure is functioning partly as an assertion. Seems reasonable as long as there is no way for it to fail without a bug in the code.

gherceg · 2024-08-14T00:32:56Z

Confirmed this is working as expected: https://app.datadoghq.com/dashboard/u88-px4-jra/repeaters-dashboard?fromUser=false&fullscreen_end_ts=1723595561362&fullscreen_paused=false&fullscreen_refresh_mode=sliding&fullscreen_section=overview&fullscreen_start_ts=1723581161362&fullscreen_widget=3480984443272320&refresh_mode=sliding&tpl_var_environment%5B0%5D=staging&view=spans&from_ts=1723581101971&to_ts=1723595501971&live=true

gherceg added 2 commits July 31, 2024 12:11

Track time spent processing repeat records

ad8ffe7

We rate limit based on time spent waiting for an endpoint to respond, but would like to have more insight on how much time is being spent outside of waiting for the endpoint since that is theoretically within our control.

Tag timings with action

f1aadcc

gherceg requested review from dannyroberts and millerdev July 31, 2024 16:24

gherceg changed the title ~~Collect timings on time spent processing repeat records (excluding waiting for endpoints)~~ Collect metrics on time spent processing repeat records (excluding waiting for endpoints) Jul 31, 2024

dannyroberts reviewed Jul 31, 2024

View reviewed changes

corehq/motech/repeaters/tasks.py Outdated Show resolved Hide resolved

dannyroberts reviewed Jul 31, 2024

View reviewed changes

corehq/motech/repeaters/tasks.py Outdated Show resolved Hide resolved

gherceg added 2 commits July 31, 2024 13:58

Update units for metrics_historgram

1057c16

Pass buckets, bucket_tag, and bucket_unit

Get endpoint timer by name

baea03a

Change bucket boundaries

a25bacb

I don't think we care (at least for now) about differentiating between buckets of 10 and 30 ms. I added a 5 second bucket, and removed two of the lower buckets.

gherceg marked this pull request as ready for review July 31, 2024 18:31

gherceg requested a review from kaapstorm as a code owner July 31, 2024 18:31

Time request in fire_for_record and pass up

7a9a690

I missed that you needed to pass a timing context down the call stack in order to take advantage of nestable timers. Rather than pass a context down, lets just time the request in fire_for_record and pass that value back up.

dannyroberts reviewed Aug 1, 2024

View reviewed changes

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

Just use a TimingContext 🤦

4183245

gherceg added 4 commits August 1, 2024 13:51

Cleanup logic to grab duration from sub timer

2124219

Merge branch 'master' into gh/repeaters/improve-timing-metrics

d686e90

Merge branch 'master' into gh/repeaters/improve-timing-metrics

0abe2b8

millerdev approved these changes Aug 6, 2024

View reviewed changes

Flatten out list of sub timers

ca9ba4d

millerdev approved these changes Aug 14, 2024

View reviewed changes

gherceg merged commit 9f78e42 into master Aug 15, 2024
13 checks passed

gherceg deleted the gh/repeaters/improve-timing-metrics branch August 15, 2024 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925

Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925

gherceg commented Jul 31, 2024 •

edited

Loading

gherceg commented Jul 31, 2024

gherceg commented Aug 1, 2024

gherceg commented Aug 5, 2024

millerdev Aug 6, 2024

gherceg Aug 7, 2024

millerdev Aug 7, 2024

gherceg commented Aug 14, 2024

	request_duration = [sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER][0]
	request_duration = sum(sub.duration for sub in fire_timer.root.subs if sub.name == ENDPOINT_TIMER)

Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925

Collect metrics on time spent processing repeat records (excluding waiting for endpoints) #34925

Conversation

gherceg commented Jul 31, 2024 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

gherceg commented Jul 31, 2024

gherceg commented Aug 1, 2024

gherceg commented Aug 5, 2024

millerdev Aug 6, 2024

Choose a reason for hiding this comment

gherceg Aug 7, 2024

Choose a reason for hiding this comment

millerdev Aug 7, 2024

Choose a reason for hiding this comment

gherceg commented Aug 14, 2024

gherceg commented Jul 31, 2024 •

edited

Loading