-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworks the prometheus metrics to adhere to best practices #5174
Reworks the prometheus metrics to adhere to best practices #5174
Conversation
Thank you for your contribution! 🙏 We will review your PR as soon as possible. While you are waiting, make sure to:
Learn more about: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic job! Congrats!
I have kept some comments inline and answering your points:
- metrics are covered by e2e tests, so you have to update this file adding the new metrics
- you've already done all the things related with the deprecation policy (after adding the change to the description), next steps have to be done once this is merged (opening issues, discussion, etc)
updated the e2e tests, didn't add anything for testing for the deprecation notice, seems silly to me to rework. also made the needed changes in the webhook, as I only spotted those when going through the e2e tests. not able to run the e2e tests locally, fingers crossed this passes |
tests fail, but I don't think it's the part I was fiddling with. Also, the ending newline in the json - no idea what's the idea there, my VSCode removes the newline, and a vim-edit seems to add one, but then it's one too many ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the DCO as well: https://github.com/kedacore/keda/pull/5174/checks?check_run_id=18936172744
config/grafana/keda-dashboard.json
Outdated
@@ -998,3 +998,4 @@ | |||
"version": 8, | |||
"weekStart": "" | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this empty line please
/run-e2e sequential |
Signed-off-by: Bernard Grymonpon <[email protected]> Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
cf81afc
to
49aaff2
Compare
fixed the DCO and the newline - seems to be all green on those. E2E is out of my league ;-) |
func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value float64) { | ||
otelScalerMetricsLatencyVal.val = value | ||
func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value time.Duration) { | ||
otelScalerMetricsLatencyVal.val = value.Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otelScalerMetricsLatencyVal.val = value.Seconds() | |
otelScalerMetricsLatencyVal.val = value.Milliseconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's been settled that KEDA will use Seconds
in opentelemetry metrics.
@@ -210,7 +211,7 @@ func (o *OtelMetrics) RecordScalableObjectLatency(namespace string, name string, | |||
attribute.Key("type").String(resourceType), | |||
attribute.Key("name").String(name)) | |||
|
|||
otelInternalLoopLatencyVal.val = value | |||
otelInternalLoopLatencyVal.val = value.Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otelInternalLoopLatencyVal.val = value.Seconds() | |
otelInternalLoopLatencyVal.val = value.Milliseconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's been settled that KEDA will use Seconds
in opentelemetry metrics.
prometheus.GaugeOpts{ | ||
Namespace: DefaultPromMetricsNamespace, | ||
Subsystem: "scaler", | ||
Name: "metrics_latency_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit is miliseconds, do we have to change the name from metrics_latency_seconds
to metrics_latency_miliseconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is preferred in prometheus to actually use the "base" unit (seconds, watts, etc): https://prometheus.io/docs/practices/naming/#base-units I'd suggest to change to the base unit, but this isn't my project... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against changing the scale (using another metric, ofc), but I'm afraid about missing information. Can we guarantee that we won't miss info? Totally ignorant question from my side but it's just for double check if Prometheus handles float numbers properly (I don't have any Prometheus server to check it right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate further, it might actually be better to add to the Otel metrics the WithUnit
(https://pkg.go.dev/go.opentelemetry.io/otel/metric#WithUnit) call and specify the unit. There could be both s
or ms
defined (see https://ucum.org/ucum#section-Tables-of-Terminal-Symbols), and supply the value accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the spamming with comments, but dug a bit deeper in the otel docs, and on https://opentelemetry.io/docs/specs/semconv/general/metrics/#instrument-units, they state (last item in the list):
When instruments are measuring durations, seconds (i.e. s) SHOULD be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, it used to be "keda_triggers_total", but I renamed it in prom to "keda_triggers_handled_total". However, looking at the code, it's an up-down counter, so it's not the historical ever-increasing number of handled triggers. It's the amount of triggers which are currently viewed/defined/loaded/registered/observed... from the CRDs.
So, the "handled" is a bad choice, my mistake. Just naming it "keda_triggers_total" is ambiguous (imho), as it is unclear what exactly about the trigger is being counted (invocations, definitions, registrations...). Exactly the same thoughts about "keda_resources_total" btw.
As this whole PR is biased by myself already, I suggest to switch to the naming of triggers_registered_total
and resources_registered_total
for prometheus, and to triggers_registered_count
and resources_registered_count
in Otel. It clearly indicated the fact that the system read, validated and started to use the trigger/CRD. I'll prep the PR in that direction (d72b860).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zroubalik @tomkerkhove WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly reminder @zroubalik @tomkerkhove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly reminder @zroubalik @tomkerkhove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG, sorry, I totally missed this convo.
+1 from my for registered
prometheus.GaugeOpts{ | ||
Namespace: DefaultPromMetricsNamespace, | ||
Subsystem: "internal_scale_loop", | ||
Name: "latency_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's been settled that KEDA will use Seconds
in metrics.
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
(although it will need an approval either from @zroubalik or @JorTurFer)
func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value float64) { | ||
otelScalerMetricsLatencyVal.val = value | ||
func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value time.Duration) { | ||
otelScalerMetricsLatencyVal.val = value.Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's been settled that KEDA will use Seconds
in opentelemetry metrics.
@@ -210,7 +211,7 @@ func (o *OtelMetrics) RecordScalableObjectLatency(namespace string, name string, | |||
attribute.Key("type").String(resourceType), | |||
attribute.Key("name").String(name)) | |||
|
|||
otelInternalLoopLatencyVal.val = value | |||
otelInternalLoopLatencyVal.val = value.Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's been settled that KEDA will use Seconds
in opentelemetry metrics.
prometheus.GaugeOpts{ | ||
Namespace: DefaultPromMetricsNamespace, | ||
Subsystem: "internal_scale_loop", | ||
Name: "latency_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's been settled that KEDA will use Seconds
in metrics.
@@ -35,16 +36,16 @@ var ( | |||
prometheus.GaugeOpts{ | |||
Namespace: DefaultPromMetricsNamespace, | |||
Name: "build_info", | |||
Help: "A metric with a constant '1' value labeled by version, git_commit and goversion from which KEDA was built.", | |||
Help: "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: was the comma intentional?
Help: "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.", | |
Help: "Info metric with static information about KEDA build like: version, git commit and Golang runtime info.", |
hey @wonko, would you be available to address the merge conflicts? I can see these files are now unable to merge.
|
hey @wonko, do you think you will be able to find some time to resolve the conflicts? It's ok if not, I can help out to cherry-pick your work and follow up in a new PR. |
@wozniakjan Been busy with some other stuff lately, but I could pick this up either this week, or this weekend latest. |
no pressure @wonko, I'm available at your convenience. I just think this is really solid work and it would be great to get this merged in the upcoming months :) |
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
4778210
to
388be0f
Compare
@wozniakjan I think it's good now ... please have a good look at it, as I did the merge with a lot of interruptions ... |
/run-e2e sequential |
hey @wonko, the e2e tests seem to fail with metrics registration error
can you please take a look at it? The error location could mean it's related to the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wozniakjan could you please check what is the problem and help us complete this PR in case @wonko doesn't have a capacity (which seems it is the case).
I would like to get this merged in 2.14 🙏
Sure thing, I can investigate tomorrow. |
#4854
I reworked the prometheus metric names to follow conventions: added units where needed, and appended
total
.Also passed a
time.Duration
down to the metrics handlers, as the conversion to units needs to happen when construction the metrics.No tests were present, none added.
No idea where to track deprecation actions (when and how will the deprecated metrics be removed?)
Checklist
Fixes #4854
Relates to kedacore/keda-docs#1258