Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: workflow_status_phase (which includes workflow name label) and workflow_start_time #13683

Open
tooptoop4 opened this issue Sep 30, 2024 · 11 comments
Labels

Comments

@tooptoop4
Copy link
Contributor

in prometheus for pods there are metrics like kube_pod_status_phase and kube_pod_start_time

need similar metrics at workflow level

@tooptoop4 tooptoop4 added the type/feature Feature request label Sep 30, 2024
@agilgur5 agilgur5 changed the title Prometheus metrics - workflow_status_phase (which includes workflow name label) and workflow_start_time Metrics: workflow_status_phase (which includes workflow name label) and workflow_start_time Sep 30, 2024
@napestershine
Copy link

@tooptoop4 is someone working on it or I can investigate it for implementation?

@tooptoop4
Copy link
Contributor Author

@napestershine u can work on it

@Joibel
Copy link
Member

Joibel commented Oct 29, 2024

There is a big problem with adding workflow name to metrics is that it is very high cardinality - it essentially creates a separate data series for every workflow. All of these data series live in memory of the workflow controller for the lifetime of the workflow controller, and the receiving store will also need to store a separate time series for each one.

I have already implemented some higher cardinality metrics (around namespaces and workflowTemplateRef names) to help with some of the issues you might be attempting to address, but blindly doing this will not be OK.

The issue description doesn't explain why these metrics are needed per workflow.

I am working on tracing for workflow support which may allow some of the metrics you want to be extracted from the traces.

@napestershine
Copy link

I might be new to this topic. So a simple use case is Lets say I have a cronworkflow and I want to check if it was triggered or not on its schedule.

@Joibel
Copy link
Member

Joibel commented Oct 29, 2024

I might be new to this topic. So a simple use case is Lets say I have a cronworkflow and I want to check if it was triggered or not on its schedule.

This proposal would give you the workflow name from which you'd have to establish the cronworkflow name.

https://argo-workflows.readthedocs.io/en/latest/metrics/#cronworkflows_triggered_total gives you this with much less cardinality.

@tooptoop4
Copy link
Contributor Author

surely they could be purged from memory if they have been succeeded/error/fail for more than 10mins

@napestershine
Copy link

@Joibel This feature is available in 3.6.x. Which has not been released yet officially. When can we expect that release?

@Joibel
Copy link
Member

Joibel commented Oct 30, 2024

@Joibel This feature is available in 3.6.x. Which has not been released yet officially. When can we expect that release?

Official answer is, as always "when it's done".

Currently there is an rc3 release out, we need to make an rc4 and then wait 2 weeks. I'd predict the first half of November now, but there aren't any promises.

Please test rc3 and us know how that works for you.

@Joibel
Copy link
Member

Joibel commented Oct 30, 2024

surely they could be purged from memory if they have been succeeded/error/fail for more than 10mins

You have to hack the opentelemetry code to do this as this isn't considered the correct way to implement metrics. We do this for custom metrics already. This only solves one half of the problem though, you're still paying heavily for your metrics storage when cardinality is high.

@tooptoop4
Copy link
Contributor Author

kube_pod_status_phase is already there with even higher cardinality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants