Reworks the prometheus metrics to adhere to best practices #5174

wonko · 2023-11-09T16:21:56Z

I reworked the prometheus metric names to follow conventions: added units where needed, and appended total.

Also passed a time.Duration down to the metrics handlers, as the conversion to units needs to happen when construction the metrics.

No tests were present, none added.

No idea where to track deprecation actions (when and how will the deprecated metrics be removed?)

Checklist

I have verified that my change is according to the deprecations & breaking changes policy
Tests have been added
Changelog has been updated and is aligned with our changelog requirements
Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #4854

Relates to kedacore/keda-docs#1258

github-actions · 2023-11-09T16:22:09Z

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

While you are waiting, make sure to:

Add an entry in our changelog in alphabetical order and link related issue
Update the documentation, if needed
Add unit & e2e tests for your changes
GitHub checks are passing
Is the DCO check failing? Here is how you can fix DCO issues

Learn more about:

Our contribution guide

JorTurFer

Fantastic job! Congrats!
I have kept some comments inline and answering your points:

metrics are covered by e2e tests, so you have to update this file adding the new metrics
you've already done all the things related with the deprecation policy (after adding the change to the description), next steps have to be done once this is merged (opening issues, discussion, etc)

CHANGELOG.md

pkg/metricscollector/prommetrics.go

wonko · 2023-11-22T15:07:14Z

updated the e2e tests, didn't add anything for testing for the deprecation notice, seems silly to me to rework.

also made the needed changes in the webhook, as I only spotted those when going through the e2e tests.

not able to run the e2e tests locally, fingers crossed this passes

wonko · 2023-11-22T15:33:44Z

tests fail, but I don't think it's the part I was fiddling with. Also, the ending newline in the json - no idea what's the idea there, my VSCode removes the newline, and a vim-edit seems to add one, but then it's one too many ...

zroubalik

Please fix the DCO as well: https://github.com/kedacore/keda/pull/5174/checks?check_run_id=18936172744

zroubalik · 2023-11-22T16:28:03Z

config/grafana/keda-dashboard.json

@@ -998,3 +998,4 @@
  "version": 8,
  "weekStart": ""
 }
+


remove this empty line please

JorTurFer · 2023-11-22T16:39:24Z

/run-e2e sequential
Update: You can check the progress here

Signed-off-by: Bernard Grymonpon <[email protected]> Signed-off-by: Bernard Grymonpon <[email protected]>

Signed-off-by: Bernard Grymonpon <[email protected]>

wonko · 2023-11-23T10:08:06Z

fixed the DCO and the newline - seems to be all green on those. E2E is out of my league ;-)

JorTurFer · 2023-11-24T13:18:11Z

pkg/metricscollector/opentelemetry.go

-func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value float64) {
-	otelScalerMetricsLatencyVal.val = value
+func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value time.Duration) {
+	otelScalerMetricsLatencyVal.val = value.Seconds()


Suggested change

otelScalerMetricsLatencyVal.val = value.Seconds()

otelScalerMetricsLatencyVal.val = value.Milliseconds()

I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

JorTurFer · 2023-11-24T13:18:19Z

pkg/metricscollector/opentelemetry.go

@@ -210,7 +211,7 @@ func (o *OtelMetrics) RecordScalableObjectLatency(namespace string, name string,
 		attribute.Key("type").String(resourceType),
 		attribute.Key("name").String(name))

-	otelInternalLoopLatencyVal.val = value
+	otelInternalLoopLatencyVal.val = value.Seconds()


Suggested change

otelInternalLoopLatencyVal.val = value.Seconds()

otelInternalLoopLatencyVal.val = value.Milliseconds()

I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

JorTurFer · 2023-11-24T13:19:44Z

pkg/metricscollector/prommetrics.go

+		prometheus.GaugeOpts{
+			Namespace: DefaultPromMetricsNamespace,
+			Subsystem: "scaler",
+			Name:      "metrics_latency_seconds",


The unit is miliseconds, do we have to change the name from metrics_latency_seconds to metrics_latency_miliseconds?

it is preferred in prometheus to actually use the "base" unit (seconds, watts, etc): https://prometheus.io/docs/practices/naming/#base-units I'd suggest to change to the base unit, but this isn't my project... ;)

I'm not against changing the scale (using another metric, ofc), but I'm afraid about missing information. Can we guarantee that we won't miss info? Totally ignorant question from my side but it's just for double check if Prometheus handles float numbers properly (I don't have any Prometheus server to check it right now)

To elaborate further, it might actually be better to add to the Otel metrics the WithUnit (https://pkg.go.dev/go.opentelemetry.io/otel/metric#WithUnit) call and specify the unit. There could be both s or ms defined (see https://ucum.org/ucum#section-Tables-of-Terminal-Symbols), and supply the value accordingly.

Sorry for the spamming with comments, but dug a bit deeper in the otel docs, and on https://opentelemetry.io/docs/specs/semconv/general/metrics/#instrument-units, they state (last item in the list):

When instruments are measuring durations, seconds (i.e. s) SHOULD be used.

For clarity, it used to be "keda_triggers_total", but I renamed it in prom to "keda_triggers_handled_total". However, looking at the code, it's an up-down counter, so it's not the historical ever-increasing number of handled triggers. It's the amount of triggers which are currently viewed/defined/loaded/registered/observed... from the CRDs.

So, the "handled" is a bad choice, my mistake. Just naming it "keda_triggers_total" is ambiguous (imho), as it is unclear what exactly about the trigger is being counted (invocations, definitions, registrations...). Exactly the same thoughts about "keda_resources_total" btw.

As this whole PR is biased by myself already, I suggest to switch to the naming of triggers_registered_total and resources_registered_total for prometheus, and to triggers_registered_count and resources_registered_count in Otel. It clearly indicated the fact that the system read, validated and started to use the trigger/CRD. I'll prep the PR in that direction (d72b860).

@zroubalik @tomkerkhove WDYT?

kindly reminder @zroubalik @tomkerkhove

OMG, sorry, I totally missed this convo.

+1 from my for registered

JorTurFer · 2023-11-24T13:20:11Z

pkg/metricscollector/prommetrics.go

+		prometheus.GaugeOpts{
+			Namespace: DefaultPromMetricsNamespace,
+			Subsystem: "internal_scale_loop",
+			Name:      "latency_seconds",


same as above

I guess it's been settled that KEDA will use Seconds in metrics.

Signed-off-by: Bernard Grymonpon <[email protected]>

wozniakjan

lgtm

(although it will need an approval either from @zroubalik or @JorTurFer)

wozniakjan · 2024-01-10T15:55:17Z

pkg/metricscollector/opentelemetry.go

-func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value float64) {
-	otelScalerMetricsLatencyVal.val = value
+func (o *OtelMetrics) RecordScalerLatency(namespace string, scaledObject string, scaler string, scalerIndex int, metric string, value time.Duration) {
+	otelScalerMetricsLatencyVal.val = value.Seconds()


I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

wozniakjan · 2024-01-10T15:55:49Z

pkg/metricscollector/opentelemetry.go

@@ -210,7 +211,7 @@ func (o *OtelMetrics) RecordScalableObjectLatency(namespace string, name string,
 		attribute.Key("type").String(resourceType),
 		attribute.Key("name").String(name))

-	otelInternalLoopLatencyVal.val = value
+	otelInternalLoopLatencyVal.val = value.Seconds()


I guess it's been settled that KEDA will use Seconds in opentelemetry metrics.

wozniakjan · 2024-01-10T16:06:15Z

pkg/metricscollector/prommetrics.go

+		prometheus.GaugeOpts{
+			Namespace: DefaultPromMetricsNamespace,
+			Subsystem: "internal_scale_loop",
+			Name:      "latency_seconds",


I guess it's been settled that KEDA will use Seconds in metrics.

wozniakjan · 2024-01-10T16:13:19Z

pkg/metricscollector/prommetrics.go

@@ -35,16 +36,16 @@ var (
 		prometheus.GaugeOpts{
 			Namespace: DefaultPromMetricsNamespace,
 			Name:      "build_info",
-			Help:      "A metric with a constant '1' value labeled by version, git_commit and goversion from which KEDA was built.",
+			Help:      "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.",


nit: was the comma intentional?

Suggested change

Help: "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.",

Help: "Info metric with static information about KEDA build like: version, git commit and Golang runtime info.",

wozniakjan · 2024-01-11T14:06:59Z

hey @wonko, would you be available to address the merge conflicts? I can see these files are now unable to merge.

        both modified:   CHANGELOG.md
        both modified:   pkg/metricscollector/metricscollectors.go
        both modified:   pkg/metricscollector/opentelemetry.go
        both modified:   pkg/metricscollector/prommetrics.go
        both modified:   pkg/scaling/scale_handler.go
        both modified:   tests/sequential/opentelemetry_metrics/opentelemetry_metrics_test.go
        both modified:   tests/sequential/prometheus_metrics/prometheus_metrics_test.go

wozniakjan · 2024-01-30T11:30:28Z

hey @wonko, do you think you will be able to find some time to resolve the conflicts? It's ok if not, I can help out to cherry-pick your work and follow up in a new PR.

wonko · 2024-01-30T13:22:54Z

@wozniakjan Been busy with some other stuff lately, but I could pick this up either this week, or this weekend latest.

wozniakjan · 2024-01-30T14:06:42Z

no pressure @wonko, I'm available at your convenience. I just think this is really solid work and it would be great to get this merged in the upcoming months :)

Signed-off-by: Bernard Grymonpon <[email protected]>

wonko · 2024-02-05T14:30:08Z

no pressure @wonko, I'm available at your convenience. I just think this is really solid work and it would be great to get this merged in the upcoming months :)

@wozniakjan I think it's good now ... please have a good look at it, as I did the merge with a lot of interruptions ...

tests/sequential/prometheus_metrics/prometheus_metrics_test.go

zroubalik · 2024-02-12T22:05:36Z

/run-e2e sequential
Update: You can check the progress here

wozniakjan · 2024-02-21T09:56:20Z

hey @wonko, the e2e tests seem to fail with metrics registration error

2024/02/12 22:24:44 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
panic: a previously registered descriptor with the same fully-qualified name as Desc{fqName: "keda_scaler_errors_total", help: "The total number of errors
encountered for each scaler.", constLabels: {}, variableLabels: {namespace,metric,scaledObject,scaler,triggerIndex,type}} has different label names or a different
help string

goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x172ccd8?, {0xc000084e80?, 0x1, 0xc000c9f5d0?})
    /workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:405 +0x78
github.com/kedacore/keda/v2/pkg/metricscollector.NewPromMetrics()
    /workspace/pkg/metricscollector/prommetrics.go:233 +0x30e
github.com/kedacore/keda/v2/pkg/metricscollector.NewMetricsCollectors(0xd?, 0x1)
    /workspace/pkg/metricscollector/metricscollectors.go:79 +0x25
main.main()
    /workspace/cmd/operator/main.go:148 +0xb3f

can you please take a look at it? The error location could mean it's related to the changes.

zroubalik

@wozniakjan could you please check what is the problem and help us complete this PR in case @wonko doesn't have a capacity (which seems it is the case).
I would like to get this merged in 2.14 🙏

wozniakjan · 2024-04-10T17:17:31Z

Sure thing, I can investigate tomorrow.

wonko requested a review from a team as a code owner November 9, 2023 16:21

wonko mentioned this pull request Nov 9, 2023

Updates docs to match PR #5174 kedacore/keda-docs#1258

Closed

1 task

JorTurFer reviewed Nov 21, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pkg/metricscollector/prommetrics.go Outdated Show resolved Hide resolved

zroubalik reviewed Nov 22, 2023

View reviewed changes

config/grafana/keda-dashboard.json Outdated

@@ -998,3 +998,4 @@

"version": 8,

"weekStart": ""

}

Copy link

Member

zroubalik Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this empty line please

wonko added 8 commits November 23, 2023 10:27

Reworks the prometheus metrics to adhere to best practices

a08cc7c

Signed-off-by: Bernard Grymonpon <[email protected]> Signed-off-by: Bernard Grymonpon <[email protected]>

Updates help info to align with the public docs

4521af3

Signed-off-by: Bernard Grymonpon <[email protected]>

Updates grafana dashboard to use the new metrics

a0e5ab8

Signed-off-by: Bernard Grymonpon <[email protected]>

resolved review comments

571419b

Signed-off-by: Bernard Grymonpon <[email protected]>

Updates webhook metrics and e2e tests

c557a34

Signed-off-by: Bernard Grymonpon <[email protected]>

newline at the end of json :thinking_face:

2fae5f3

Signed-off-by: Bernard Grymonpon <[email protected]>

correct tests

7158f7f

Signed-off-by: Bernard Grymonpon <[email protected]>

another go at the newline issue

49aaff2

Signed-off-by: Bernard Grymonpon <[email protected]>

wonko force-pushed the feature/rework-prometheus-metric-names branch from cf81afc to 49aaff2 Compare November 23, 2023 09:27

JorTurFer reviewed Nov 24, 2023

View reviewed changes

wonko added 6 commits November 25, 2023 09:54

Reworked otel metrics to align with best practices

623d4c7

Signed-off-by: Bernard Grymonpon <[email protected]>

updated E2E tests for Otel

13ca4a5

Signed-off-by: Bernard Grymonpon <[email protected]>

Handled -> Registered

d72b860

Signed-off-by: Bernard Grymonpon <[email protected]>

align namespace naming to not be pluralized

fd45826

Signed-off-by: Bernard Grymonpon <[email protected]>

rewrite the metric names in the tests

f97a650

Signed-off-by: Bernard Grymonpon <[email protected]>

even more rewriting ... 🤦

36ca55a

Signed-off-by: Bernard Grymonpon <[email protected]>

JorTurFer mentioned this pull request Dec 18, 2023

Expose prometheus metrics for ScaledJob resources #4913

Merged

4 tasks

wozniakjan approved these changes Jan 10, 2024

View reviewed changes

wozniakjan reviewed Jan 10, 2024

View reviewed changes

wonko added 3 commits February 5, 2024 10:48

Merge branch 'main' into feature/rework-prometheus-metric-names

8c97d52

Tuning merge

3a85292

Signed-off-by: Bernard Grymonpon <[email protected]>

fixing tests

388be0f

Signed-off-by: Bernard Grymonpon <[email protected]>

wonko force-pushed the feature/rework-prometheus-metric-names branch from 4778210 to 388be0f Compare February 5, 2024 14:12

wozniakjan reviewed Feb 11, 2024

View reviewed changes

tests/sequential/prometheus_metrics/prometheus_metrics_test.go Show resolved Hide resolved

zroubalik reviewed Apr 10, 2024

View reviewed changes

This was referenced Apr 12, 2024

Reworks the prometheus metrics to adhere to best practices #5687

Merged

Release: 2.14 #5671

Closed

Document prometheus metric deprecations kedacore/keda-docs#1374

Merged

zroubalik closed this Apr 19, 2024

	otelScalerMetricsLatencyVal.val = value.Seconds()
	otelScalerMetricsLatencyVal.val = value.Milliseconds()

	otelInternalLoopLatencyVal.val = value.Seconds()
	otelInternalLoopLatencyVal.val = value.Milliseconds()

	Help: "Info metric, with static information about KEDA build like: version, git commit and Golang runtime info.",
	Help: "Info metric with static information about KEDA build like: version, git commit and Golang runtime info.",

Reworks the prometheus metrics to adhere to best practices #5174

Reworks the prometheus metrics to adhere to best practices #5174

Conversation

wonko commented Nov 9, 2023 • edited Loading

Checklist

github-actions bot commented Nov 9, 2023

JorTurFer left a comment

Choose a reason for hiding this comment

wonko commented Nov 22, 2023

wonko commented Nov 22, 2023

zroubalik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JorTurFer commented Nov 22, 2023 • edited by github-actions bot Loading

wonko commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonko Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zroubalik Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wozniakjan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wozniakjan commented Jan 11, 2024

wozniakjan commented Jan 30, 2024

wonko commented Jan 30, 2024

wozniakjan commented Jan 30, 2024

wonko commented Feb 5, 2024

zroubalik commented Feb 12, 2024 • edited by github-actions bot Loading

wozniakjan commented Feb 21, 2024

zroubalik left a comment

Choose a reason for hiding this comment

wozniakjan commented Apr 10, 2024

wonko commented Nov 9, 2023 •

edited

Loading

JorTurFer commented Nov 22, 2023 •

edited by github-actions bot

Loading

wonko Nov 28, 2023 •

edited

Loading

zroubalik Dec 19, 2023 •

edited

Loading

wozniakjan left a comment •

edited

Loading

zroubalik commented Feb 12, 2024 •

edited by github-actions bot

Loading