Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworks the prometheus metrics to adhere to best practices #5174

Closed

Conversation

wonko
Copy link
Contributor

@wonko wonko commented Nov 9, 2023

#4854

I reworked the prometheus metric names to follow conventions: added units where needed, and appended total.

Also passed a time.Duration down to the metrics handlers, as the conversion to units needs to happen when construction the metrics.

No tests were present, none added.

No idea where to track deprecation actions (when and how will the deprecated metrics be removed?)

Checklist

Fixes #4854

Relates to kedacore/keda-docs#1258

@wonko wonko requested a review from a team as a code owner November 9, 2023 16:21
Copy link

github-actions bot commented Nov 9, 2023

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

While you are waiting, make sure to:

Learn more about:

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic job! Congrats!
I have kept some comments inline and answering your points:

  • metrics are covered by e2e tests, so you have to update this file adding the new metrics
  • you've already done all the things related with the deprecation policy (after adding the change to the description), next steps have to be done once this is merged (opening issues, discussion, etc)

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/metricscollector/prommetrics.go Outdated Show resolved Hide resolved
@wonko
Copy link
Contributor Author

wonko commented Nov 22, 2023

updated the e2e tests, didn't add anything for testing for the deprecation notice, seems silly to me to rework.

also made the needed changes in the webhook, as I only spotted those when going through the e2e tests.

not able to run the e2e tests locally, fingers crossed this passes

@wonko
Copy link
Contributor Author

wonko commented Nov 22, 2023

tests fail, but I don't think it's the part I was fiddling with. Also, the ending newline in the json - no idea what's the idea there, my VSCode removes the newline, and a vim-edit seems to add one, but then it's one too many ...

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -998,3 +998,4 @@
"version": 8,
"weekStart": ""
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this empty line please

@JorTurFer
Copy link
Member

JorTurFer commented Nov 22, 2023

/run-e2e sequential
Update: You can check the progress here

Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
@wonko wonko force-pushed the feature/rework-prometheus-metric-names branch from cf81afc to 49aaff2 Compare November 23, 2023 09:27
@wonko
Copy link
Contributor Author

wonko commented Nov 23, 2023

fixed the DCO and the newline - seems to be all green on those. E2E is out of my league ;-)

func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value float64) {
otelScalerMetricsLatencyVal.val = value
func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value time.Duration) {
otelScalerMetricsLatencyVal.val = value.Seconds()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
otelScalerMetricsLatencyVal.val = value.Seconds()
otelScalerMetricsLatencyVal.val = value.Milliseconds()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

@@ -210,7 +211,7 @@ func (o *OtelMetrics) RecordScalableObjectLatency(namespace string, name string,
attribute.Key("type").String(resourceType),
attribute.Key("name").String(name))

otelInternalLoopLatencyVal.val = value
otelInternalLoopLatencyVal.val = value.Seconds()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
otelInternalLoopLatencyVal.val = value.Seconds()
otelInternalLoopLatencyVal.val = value.Milliseconds()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

prometheus.GaugeOpts{
Namespace: DefaultPromMetricsNamespace,
Subsystem: "scaler",
Name: "metrics_latency_seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit is miliseconds, do we have to change the name from metrics_latency_seconds to metrics_latency_miliseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is preferred in prometheus to actually use the "base" unit (seconds, watts, etc): https://prometheus.io/docs/practices/naming/#base-units I'd suggest to change to the base unit, but this isn't my project... ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against changing the scale (using another metric, ofc), but I'm afraid about missing information. Can we guarantee that we won't miss info? Totally ignorant question from my side but it's just for double check if Prometheus handles float numbers properly (I don't have any Prometheus server to check it right now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate further, it might actually be better to add to the Otel metrics the WithUnit (https://pkg.go.dev/go.opentelemetry.io/otel/metric#WithUnit) call and specify the unit. There could be both s or ms defined (see https://ucum.org/ucum#section-Tables-of-Terminal-Symbols), and supply the value accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the spamming with comments, but dug a bit deeper in the otel docs, and on https://opentelemetry.io/docs/specs/semconv/general/metrics/#instrument-units, they state (last item in the list):

When instruments are measuring durations, seconds (i.e. s) SHOULD be used.

Copy link
Contributor Author

@wonko wonko Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, it used to be "keda_triggers_total", but I renamed it in prom to "keda_triggers_handled_total". However, looking at the code, it's an up-down counter, so it's not the historical ever-increasing number of handled triggers. It's the amount of triggers which are currently viewed/defined/loaded/registered/observed... from the CRDs.

So, the "handled" is a bad choice, my mistake. Just naming it "keda_triggers_total" is ambiguous (imho), as it is unclear what exactly about the trigger is being counted (invocations, definitions, registrations...). Exactly the same thoughts about "keda_resources_total" btw.

As this whole PR is biased by myself already, I suggest to switch to the naming of triggers_registered_total and resources_registered_total for prometheus, and to triggers_registered_count and resources_registered_count in Otel. It clearly indicated the fact that the system read, validated and started to use the trigger/CRD. I'll prep the PR in that direction (d72b860).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly reminder @zroubalik @tomkerkhove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly reminder @zroubalik @tomkerkhove

Copy link
Member

@zroubalik zroubalik Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG, sorry, I totally missed this convo.

+1 from my for registered

prometheus.GaugeOpts{
Namespace: DefaultPromMetricsNamespace,
Subsystem: "internal_scale_loop",
Name: "latency_seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's been settled that KEDA will use Seconds in metrics.

Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Signed-off-by: Bernard Grymonpon <[email protected]>
Copy link
Member

@wozniakjan wozniakjan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

(although it will need an approval either from @zroubalik or @JorTurFer)

func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value float64) {
otelScalerMetricsLatencyVal.val = value
func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value time.Duration) {
otelScalerMetricsLatencyVal.val = value.Seconds()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

@@ -210,7 +211,7 @@ func (o *OtelMetrics) RecordScalableObjectLatency(namespace string, name string,
attribute.Key("type").String(resourceType),
attribute.Key("name").String(name))

otelInternalLoopLatencyVal.val = value
otelInternalLoopLatencyVal.val = value.Seconds()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

prometheus.GaugeOpts{
Namespace: DefaultPromMetricsNamespace,
Subsystem: "internal_scale_loop",
Name: "latency_seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's been settled that KEDA will use Seconds in metrics.

@@ -35,16 +36,16 @@ var (
prometheus.GaugeOpts{
Namespace: DefaultPromMetricsNamespace,
Name: "build_info",
Help: "A metric with a constant '1' value labeled by version, git_commit and goversion from which KEDA was built.",
Help: "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: was the comma intentional?

Suggested change
Help: "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.",
Help: "Info metric with static information about KEDA build like: version, git commit and Golang runtime info.",

@wozniakjan
Copy link
Member

hey @wonko, would you be available to address the merge conflicts? I can see these files are now unable to merge.

        both modified:   CHANGELOG.md
        both modified:   pkg/metricscollector/metricscollectors.go
        both modified:   pkg/metricscollector/opentelemetry.go
        both modified:   pkg/metricscollector/prommetrics.go
        both modified:   pkg/scaling/scale_handler.go
        both modified:   tests/sequential/opentelemetry_metrics/opentelemetry_metrics_test.go
        both modified:   tests/sequential/prometheus_metrics/prometheus_metrics_test.go

@wozniakjan
Copy link
Member

hey @wonko, do you think you will be able to find some time to resolve the conflicts? It's ok if not, I can help out to cherry-pick your work and follow up in a new PR.

@wonko
Copy link
Contributor Author

wonko commented Jan 30, 2024

@wozniakjan Been busy with some other stuff lately, but I could pick this up either this week, or this weekend latest.

@wozniakjan
Copy link
Member

no pressure @wonko, I'm available at your convenience. I just think this is really solid work and it would be great to get this merged in the upcoming months :)

@wonko wonko force-pushed the feature/rework-prometheus-metric-names branch from 4778210 to 388be0f Compare February 5, 2024 14:12
@wonko
Copy link
Contributor Author

wonko commented Feb 5, 2024

no pressure @wonko, I'm available at your convenience. I just think this is really solid work and it would be great to get this merged in the upcoming months :)

@wozniakjan I think it's good now ... please have a good look at it, as I did the merge with a lot of interruptions ...

@zroubalik
Copy link
Member

zroubalik commented Feb 12, 2024

/run-e2e sequential
Update: You can check the progress here

@wozniakjan
Copy link
Member

hey @wonko, the e2e tests seem to fail with metrics registration error

2024/02/12 22:24:44 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
panic: a previously registered descriptor with the same fully-qualified name as Desc{fqName: "keda_scaler_errors_total", help: "The total number of errors
encountered for each scaler.", constLabels: {}, variableLabels: {namespace,metric,scaledObject,scaler,triggerIndex,type}} has different label names or a different
help string

goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x172ccd8?, {0xc000084e80?, 0x1, 0xc000c9f5d0?})
    /workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:405 +0x78
github.com/kedacore/keda/v2/pkg/metricscollector.NewPromMetrics()
    /workspace/pkg/metricscollector/prommetrics.go:233 +0x30e
github.com/kedacore/keda/v2/pkg/metricscollector.NewMetricsCollectors(0xd?, 0x1)
    /workspace/pkg/metricscollector/metricscollectors.go:79 +0x25
main.main()
    /workspace/cmd/operator/main.go:148 +0xb3f

can you please take a look at it? The error location could mean it's related to the changes.

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wozniakjan could you please check what is the problem and help us complete this PR in case @wonko doesn't have a capacity (which seems it is the case).
I would like to get this merged in 2.14 🙏

@wozniakjan
Copy link
Member

Sure thing, I can investigate tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize prometheus naming convention for keda metrics
4 participants