-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add client request size metric channel #2023
Conversation
Generate changelog in
|
private final EndpointChannel delegate; | ||
private final Histogram requestSize; | ||
|
||
static EndpointChannel create(Config cf, EndpointChannel channel, Endpoint endpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to wire this up from DialogueChannel
so that it's part of the chain of channels used to send requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! wasn't sure if we'd want to do that separately or not, and also wasn't sure where in the chain to add this... maybe alongside the timing endpoint channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the timing endpoint channel is in a conditional to protect against an edge case where clients may arbitrarily build new Endpoint
objects (if they're not using standard dialogue generated bindings). I don't want to miss out on that data, so perhaps we can avoid fanout by setting a relatively high threshold for reporting data, something like 1mb+.
We'll need to restructure a bit of the code to avoid creating a histogram until we've seen a large request for the first time (which is impossible-ish on GET endpoints, and many PUTs/POSTs will never reach it) by creating a memoized supplier for the histogram rather than passing the histogram directly.
Setting a threshold improves the probability of capturing maximums as well, since we use a sampling reservoir, we only hold ~1024 samples at a given time (with heavy recency bias), so reducing small and uninteresting samples will increase the probability we capture outliers.
@@ -56,6 +56,10 @@ namespaces: | |||
type: timer | |||
tags: [ channel-name ] | |||
docs: Time spent waiting in the sticky queue before execution attempt. | |||
requests.size: | |||
type: histogram | |||
tags: [service-name, endpoint] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include channel-name
which is roughly equivalent to the service name in a service discovery block (this shouldn't increase cardinality)
if (body.isEmpty()) { | ||
// No need to record empty bodies | ||
return request; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this will record metrics for both requests which can and cannot be retried, without distinction. Should we tag metrics based on whether or not retires are allowed? (we could hold two histograms and pass the correct one to RequestSizeRecordingRequestBody
based on repeatable()
).
dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to override and pass through the default
interface methods from RequestBody
(some tests are failing due to the missing Content-Length
header)
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java
Outdated
Show resolved
Hide resolved
dialogue-clients/metrics.md
Outdated
@@ -71,6 +71,11 @@ Dialogue-specific metrics that are not necessarily applicable to other client im | |||
- `dialogue.client.request.queued.time` tagged `channel-name` (timer): Time spent waiting in the queue before execution. | |||
- `dialogue.client.request.endpoint.queued.time` tagged `channel-name`, `service-name`, `endpoint` (timer): Time spent waiting in the queue before execution on a specific endpoint due to server QoS. | |||
- `dialogue.client.request.sticky.queued.time` tagged `channel-name` (timer): Time spent waiting in the sticky queue before execution attempt. | |||
- `dialogue.client.requests.size` (histogram): Size of requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to describe the reporting size threshold in this metric description (though I suppose it doesn't claim to report the size of every request ;-)
Can update the the metric-schema/docs later if needs be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! Yeah, good to call this out... might even be worth renaming the metric to large_requests
or something to explicitly call out that there's some distinction/threshold that's used for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current name gives us flexibility if we want to change thresholds later on, I don't mind it as is if you're happy with it. Thanks for updating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, thanks for the contribution! :-)
Released 3.94.0 |
This reverts commit 50b0bfd.
Before this PR
Currently, the best way to understand the size of requests made by a particular client is to analyze server request logs. While this does provide some high-level insight into the behavior of a client, it doesn't provide any information about the clients understanding of if a request is "repeatable" or not. This is important as not all retry approaches will have the same opinion on the repeatability of a request (e.g. envoy has a payload size constraint), so we need to know which large requests dialogue currently believes would get retried.
After this PR
==COMMIT_MSG==
Add a request size metric channel, which records the size of payloads written by the client.
==COMMIT_MSG==
Possible downsides?
This is currently as histogram, but I don't believe we need all of the metrics that would come out of it, and it's likely quite expensive to add this metric to all clients. I think we probably only care about the
p99
ormax
values here, as average request size is largely irrelevant.