-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize build cluster performance #3890
Comments
Starting to look at co-located jobs during flakes Flake: https://prow.istio.io/view/gs/istio-prow/logs/integ-k8s-120_istio_postsubmit/1498955294602956800 mysterious 141 error. OOM? At 2022-03-02T09:40:43.788454Z Co-located with integ-pilot (started at same time) and integ-assertion (deep into run). Total memory by all 3 is only 12gb, not very concerning TestReachability/global-plaintext/b_in_primary/tcp_to_headless:tcp_positive failure at 2022-03-02T04:10:57.231356Z Co-located with integ-k8s-119 that started at the same time. It was using near zero cpu at the time of the test failure - it was literally doing nothing ( a bug of its own ) |
https://prow.istio.io/view/gs/istio-prow/logs/integ-security-multicluster_istio_postsubmit/1498793275824279552 double failure! TestReachability/beta-mtls-permissive/b_in_primary/tcp_to_b:tcp_positive at 2022-03-01T23:10:08.013392Z Co-scheduled with distroless job that started after. Disrtoless job peaks in CPU from 23:00 but is done by 23:05 - way before the failures. Also co-scheduled with helm test. This one runs from 23:10 to 23:16. So it really shouldn't be overlapping with either of the failures - it is close though So in 3 cases I looked at, coscheduling doesn't seem to be related. |
This issue aims to track all things related to optimizing our build cluster performance.
We have done a lot of work to reduce test flakes, but we still seem them relatively often. In a large number of cases, these appear to occur when things that should always succeed fail for reasons outside of poorly written tests or buggy Istiocode. For example, simple HTTP requests timing out after many seconds.
We have had two similar issues in the past:
echo
takes over 60s in some cases. Triggered by a node upgrade in GKE. We switch from ubuntu to COS to mitigate this. Root cause unknown to date.Current state:
echo
. On a health machine, this should, of course, take near 0ms. We often see this spike, correlated with increased CPU usage.Top shows grouped by node type, bottom all nodes. You can see spikes up to 2.5s. Note: the node type graph is likely misleading; we have a small fixed number of n2/t2d nodes but a large dynamic number of e2 nodes. This means there are more samples for e2 AND it has more cache misses.
Things to try:
The text was updated successfully, but these errors were encountered: