You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CPU/Memory usage, relative to requests/limits (and other useful node and job metrics) provide by Stackdriver Prow dashboard. I am experimenting with various alerts here; once the alerts are properly tuned, I will push these to Slack as well.
What we do not have:
How much capacity we have left, so we can see if we need to scale our cluster up/down. I think this requires the GCP monitoring agent because I recall this data being unavailable.
It would be useful to have metrics like
Nice to haves would be seeing this per job or something.
Looking at stackdriver, it seems we can get the GCE stats of the underlying nodes, but I don't see the Kubernetes metrics there.
Its possible I am also just looking in the wrong place and we have these already
The text was updated successfully, but these errors were encountered: