Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Prometheus rule KubeletTooManyPods incorrect statistics #997

Open
4 tasks done
jeffaryhe opened this issue Dec 13, 2024 · 1 comment
Open
4 tasks done

[Bug]: Prometheus rule KubeletTooManyPods incorrect statistics #997

jeffaryhe opened this issue Dec 13, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@jeffaryhe
Copy link

What happened?

prometheus-operator/kube-prometheus#2558 pls look this

Please provide any helpful snippets.

No response

What parts of the codebase are affected?

Rules

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.
@skl
Copy link
Collaborator

skl commented Dec 16, 2024

Hi @jeffaryhe, thanks for the report. I had a look at the issue you raised:

I also had a look at the KubeletTooManyPods alert rule:

expr: |||
count by(%(clusterLabel)s, node) (
(kube_pod_status_phase{%(kubeStateMetricsSelector)s,phase="Running"} == 1) * on(instance,pod,namespace,%(clusterLabel)s) group_left(node) topk by(instance,pod,namespace,%(clusterLabel)s) (1, kube_pod_info{%(kubeStateMetricsSelector)s})
)
/
max by(%(clusterLabel)s, node) (
kube_node_status_capacity{%(kubeStateMetricsSelector)s,resource="pods"} != 1
) > 0.95
||| % $._config,
'for': '15m',

From what I can see, the alert fires if the count of running pods on a node is at >95% that of the pod limit for that node.

Can you help me understand which statistics you see as incorrect? For example, do you think part of the alert rule could be improved?

@skl skl self-assigned this Dec 16, 2024
@skl skl added the question Further information is requested label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants