-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize network metrics collection #3302
Conversation
Hi @kolyshkin. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
493a5b4
to
575c030
Compare
@bobbypage PTAL Also, is there anything I can do to skip waiting for ok-to-test? The bot says I need to join kubernetes org but I am already a member. |
Thanks @kolyshkin for the investigation here and perf fixes. Will take a closer look!
Hmm, I'm not sure, I was under the assumption that all k8s members could skip the ok-to-test, but there may be some differences for non k8s repos... |
container/containerd/handler.go
Outdated
if !metrics.HasAny(container.AllNetworkMetrics) { | ||
return metrics | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason we need to first check if the network metrics are included and only if true, then we call metrics.Difference
? Shouldn't it be safe to always call metric.Difference
(if the network metrics are not included, it should be a no-op)?
Or is this check more a performance optimization to avoid creating a new MetricSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a performance optimization -- why bothering with copying if we don't have to? The downside is code looks a bit less compact than it could be -- but still pretty readable.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@@ -217,7 +216,6 @@ func newDockerContainerHandler( | |||
Namespace: DockerNamespace, | |||
} | |||
handler.image = ctnr.Config.Image | |||
handler.networkMode = ctnr.HostConfig.NetworkMode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the only user of h.networkMode
was func (h *dockerContainerHandler) needNet() bool
.
That method is now gone, and so the field is unused.
A couple small questions but changes largely LGTM. Has this been tested locally? It would be nice to test this end to end, by integrating into k8s and ensuring that the network metrics are still present. We have an e2e test in k8s, One idea, can you maybe open a WIP PR to k/k that vendors this copy of cAdvisor (with the changes in the PR) and then use that WIP PR to run pull job in k/k to verify network metrics are still working correctly? You should be able to vendor this into k/k PR via:
|
It's actually checking if you you're a member of the organization the PR is opened to (so github.com/google), but the join link is mis-configured to Kubernetes's. The first part of the bot message is correct:
But the join link is wrongly configured: https://github.com/kubernetes/test-infra/blob/62ef55338ead7466542e0c20d226c695f97ac3f4/config/prow/plugins.yaml#L21 Prow had support for setting a trusted organization but that's been long deprecated in favor of "is a member of the organization that owns the repo || is a collaborator on the repo" (which can be restricted to only org members by config). cAdvisor as a Google project should stop using Kubernetes's instance and use the Google OSS instance anyhow (#3116), now that we have distinct instances and dedicated funding for Kubernetes (well, for years now ... it's one of the few projects lagging on switching). |
Could you provide tests for |
575c030
to
177b129
Compare
OK, I did a rewrite to avoid code repetition, added a unit tests for RemoveNetMetrics (as requested by @iwankgb), improved some comments and hopefully made the whole thing more readable. kubernetes/kubernetes#117833 is also updated. |
@bobbypage if you don't have more comments then I'm going to merge it tomorrow. |
Apparently, we collect network stats for *all* containers, and then discard most (or some) of these statistics: - for both crio and containerd, we collect and discard stats for non-infra containers containers (i.e. most of containers); - for docker, we collect and discard stats for containers which share netns with another container (which is rare I guess); Instead of reading and parsing a bunch of files in /proc/PID/net and when discarding the just-gathered stats, let's set things up in a way so we don't collect useless stats in the first place. This should improve performance, memory usage, and ease the load on garbage collection. Signed-off-by: Kir Kolyshkin <[email protected]>
177b129
to
eac1257
Compare
Whoops, fixed the logic (that was inverted for containerd and crio). 😊 diff --git a/container/containerd/handler.go b/container/containerd/handler.go
index 626d4fd3..57a2d82c 100644
--- a/container/containerd/handler.go
+++ b/container/containerd/handler.go
@@ -131,7 +131,7 @@ func newContainerdContainerHandler(
// infrastructure container -- does not need their stats to be
// reported. This stops metrics being reported multiple times for each
// container in a pod.
- metrics := common.RemoveNetMetrics(includedMetrics, cntr.Labels["io.cri-containerd.kind"] == "sandbox")
+ metrics := common.RemoveNetMetrics(includedMetrics, cntr.Labels["io.cri-containerd.kind"] != "sandbox")
libcontainerHandler := containerlibcontainer.NewHandler(cgroupManager, rootfs, int(taskPid), metrics)
diff --git a/container/crio/handler.go b/container/crio/handler.go
index e945b790..a6832518 100644
--- a/container/crio/handler.go
+++ b/container/crio/handler.go
@@ -154,7 +154,7 @@ func newCrioContainerHandler(
// infrastructure container -- does not need their stats to be
// reported. This stops metrics being reported multiple times for each
// container in a pod.
- metrics := common.RemoveNetMetrics(includedMetrics, cInfo.Labels["io.kubernetes.container.name"] == "POD")
+ metrics := common.RemoveNetMetrics(includedMetrics, cInfo.Labels["io.kubernetes.container.name"] != "POD")
libcontainerHandler := containerlibcontainer.NewHandler(cgroupManager, rootFs, cInfo.Pid, metrics)
|
This is just to run e2e tests with cAdvisor from google/cadvisor#3302 Signed-off-by: Kir Kolyshkin <[email protected]>
OTOH this gave a way to check if the test mentioned in #3302 (comment) is working. Indeed it is! From https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/117833/pull-kubernetes-node-e2e-containerd/1655703979042017280/:
|
With the change from #3302 (comment) the e2e tests are now passing in kubernetes/kubernetes#117833 (this time I ran crio-e2e as well).
|
all updates LGTM! thank you for testing in k/k and glad to see it was helpful and caught a potential issue! :) |
@iwankgb please take a final look and merge if it's all ok from your side. LGTM from me. |
I actually found the bug while re-reviewing this PR first, and then took a look at test results. In retrospect, it's good that we validated that the e2e test works. |
here's a follow up to this we can make for cri-o #3592 |
Apparently, we collect network stats for all containers, and then discarding some or most of it:
for docker, we collect and discard stats for containers which share netns with another container (which is rare I guess);
for both crio and containerd, we collect and discard stats for containers that are not infra (sandbox, pod, pause) containers (which is very common).
Instead of reading and parsing a bunch of files in /proc/PID/net and when removing those, let's set things up in a way so we don't collect useless stats in the first place.
This should improve performance, memory usage, and ease the load on garbage collection.