Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: tweak search request duration metric to record success / failure #507

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ggilmore
Copy link
Contributor

@ggilmore ggilmore commented Jan 3, 2023

This tweaks the search request duration metric to record whether or not the search itself was successful. This follows the advice from Google SRE's Monitoring Distributed Systems:

Latency

... It’s important to distinguish between the latency of successful requests and the latency of failed requests.

For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

@ggilmore ggilmore requested a review from a team January 3, 2023 18:39
@ggilmore ggilmore changed the title search: tweak search request duration metric to record success / failure monitoring: tweak search request duration metric to record success / failure Jan 3, 2023
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look as I'm trying to learn more about our observability features.

Help: "The duration a search request took in seconds",
Buckets: prometheus.DefBuckets, // DefBuckets good for service timings
})
metricSearchDuration = promauto.NewHistogramVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how prometheus handles changes to metric types. Maybe because they're both histograms, it's not really a change and now there will just be labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani

In Prometheus' data model, time series are uniquely identified by their name (zoekt_search_duration_seconds) and any attached labels (success). In my interpretation, this means that adding a new label creates a new time series. When we write Grafana dasbhoards that query on zoekt_search_duration_seconds{success=true}, the old data points (from zoekt_search_duration_seconds observations that occurred before we added the success label ) won't be returned from the Prometheus server.

Does this answer your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks for the reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants