Optimize network metrics collection #3302

kolyshkin · 2023-05-04T01:23:32Z

Apparently, we collect network stats for all containers, and then discarding some or most of it:

for docker, we collect and discard stats for containers which share netns with another container (which is rare I guess);
for both crio and containerd, we collect and discard stats for containers that are not infra (sandbox, pod, pause) containers (which is very common).

Instead of reading and parsing a bunch of files in /proc/PID/net and when removing those, let's set things up in a way so we don't collect useless stats in the first place.

This should improve performance, memory usage, and ease the load on garbage collection.

k8s-ci-robot · 2023-05-04T01:23:41Z

Hi @kolyshkin. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mrunalp · 2023-05-04T01:25:40Z

/ok-to-test

kolyshkin · 2023-05-04T18:04:35Z

@bobbypage PTAL

Also, is there anything I can do to skip waiting for ok-to-test? The bot says I need to join kubernetes org but I am already a member.

bobbypage · 2023-05-04T20:10:46Z

Thanks @kolyshkin for the investigation here and perf fixes. Will take a closer look!

Also, is there anything I can do to skip waiting for ok-to-test? The bot says I need to join kubernetes org but I am already a member.

Hmm, I'm not sure, I was under the assumption that all k8s members could skip the ok-to-test, but there may be some differences for non k8s repos...

bobbypage · 2023-05-04T20:29:36Z

container/containerd/handler.go

+	if !metrics.HasAny(container.AllNetworkMetrics) {
+		return metrics
+	}
+


Is there any reason we need to first check if the network metrics are included and only if true, then we call metrics.Difference? Shouldn't it be safe to always call metric.Difference (if the network metrics are not included, it should be a no-op)?

Or is this check more a performance optimization to avoid creating a new MetricSet?

Yes, this is a performance optimization -- why bothering with copying if we don't have to? The downside is code looks a bit less compact than it could be -- but still pretty readable.

bobbypage · 2023-05-04T20:40:55Z

container/docker/handler.go

@@ -217,7 +216,6 @@ func newDockerContainerHandler(
 		Namespace: DockerNamespace,
 	}
 	handler.image = ctnr.Config.Image
-	handler.networkMode = ctnr.HostConfig.NetworkMode


why was this removed?

This is because the only user of h.networkMode was func (h *dockerContainerHandler) needNet() bool.

That method is now gone, and so the field is unused.

bobbypage · 2023-05-04T20:48:07Z

A couple small questions but changes largely LGTM. Has this been tested locally? It would be nice to test this end to end, by integrating into k8s and ensuring that the network metrics are still present.

We have an e2e test in k8s, summary_test that verifies network metrics (xref: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/summary_test.go#L208-L218).

One idea, can you maybe open a WIP PR to k/k that vendors this copy of cAdvisor (with the changes in the PR) and then use that WIP PR to run pull job in k/k to verify network metrics are still working correctly?

You should be able to vendor this into k/k PR via:

$ hack/pin-dependency.sh github.com/google/cadvisor=github.com/kolyshkin/cadvisor optimize-net-metrics
$ hack/update-vendor.sh

kolyshkin · 2023-05-06T03:07:34Z

Sure; kubernetes/kubernetes#117833

BenTheElder · 2023-05-06T05:19:35Z

Also, is there anything I can do to skip waiting for ok-to-test? The bot #3302 (comment) but I am already a member.

It's actually checking if you you're a member of the organization the PR is opened to (so github.com/google), but the join link is mis-configured to Kubernetes's.

The first part of the bot message is correct:

I'm waiting for a google member to verify that this patch is reasonable to test.

But the join link is wrongly configured: https://github.com/kubernetes/test-infra/blob/62ef55338ead7466542e0c20d226c695f97ac3f4/config/prow/plugins.yaml#L21

Prow had support for setting a trusted organization but that's been long deprecated in favor of "is a member of the organization that owns the repo || is a collaborator on the repo" (which can be restricted to only org members by config).

cAdvisor as a Google project should stop using Kubernetes's instance and use the Google OSS instance anyhow (#3116), now that we have distinct instances and dedicated funding for Kubernetes (well, for years now ... it's one of the few projects lagging on switching).

iwankgb · 2023-05-07T19:09:06Z

Could you provide tests for maybeRemoveNet() to avoid weird regression in future, please?

kolyshkin · 2023-05-08T22:07:46Z

OK, I did a rewrite to avoid code repetition, added a unit tests for RemoveNetMetrics (as requested by @iwankgb), improved some comments and hopefully made the whole thing more readable.

kubernetes/kubernetes#117833 is also updated.

iwankgb · 2023-05-09T09:54:54Z

@bobbypage if you don't have more comments then I'm going to merge it tomorrow.

Apparently, we collect network stats for *all* containers, and then discard most (or some) of these statistics: - for both crio and containerd, we collect and discard stats for non-infra containers containers (i.e. most of containers); - for docker, we collect and discard stats for containers which share netns with another container (which is rare I guess); Instead of reading and parsing a bunch of files in /proc/PID/net and when discarding the just-gathered stats, let's set things up in a way so we don't collect useless stats in the first place. This should improve performance, memory usage, and ease the load on garbage collection. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2023-05-09T17:25:52Z

Whoops, fixed the logic (that was inverted for containerd and crio). 😊

diff --git a/container/containerd/handler.go b/container/containerd/handler.go
index 626d4fd3..57a2d82c 100644
--- a/container/containerd/handler.go
+++ b/container/containerd/handler.go
@@ -131,7 +131,7 @@ func newContainerdContainerHandler(
        // infrastructure container -- does not need their stats to be
        // reported. This stops metrics being reported multiple times for each
        // container in a pod.
-       metrics := common.RemoveNetMetrics(includedMetrics, cntr.Labels["io.cri-containerd.kind"] == "sandbox")
+       metrics := common.RemoveNetMetrics(includedMetrics, cntr.Labels["io.cri-containerd.kind"] != "sandbox")
 
        libcontainerHandler := containerlibcontainer.NewHandler(cgroupManager, rootfs, int(taskPid), metrics)
 
diff --git a/container/crio/handler.go b/container/crio/handler.go
index e945b790..a6832518 100644
--- a/container/crio/handler.go
+++ b/container/crio/handler.go
@@ -154,7 +154,7 @@ func newCrioContainerHandler(
        // infrastructure container -- does not need their stats to be
        // reported. This stops metrics being reported multiple times for each
        // container in a pod.
-       metrics := common.RemoveNetMetrics(includedMetrics, cInfo.Labels["io.kubernetes.container.name"] == "POD")
+       metrics := common.RemoveNetMetrics(includedMetrics, cInfo.Labels["io.kubernetes.container.name"] != "POD")
 
        libcontainerHandler := containerlibcontainer.NewHandler(cgroupManager, rootFs, cInfo.Pid, metrics)

This is just to run e2e tests with cAdvisor from google/cadvisor#3302 Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2023-05-09T17:53:08Z

Whoops, fixed the logic (that was inverted for containerd and crio).

OTOH this gave a way to check if the test mentioned in #3302 (comment) is working.

Indeed it is! From https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/117833/pull-kubernetes-node-e2e-containerd/1655703979042017280/:

E2eNode Suite: [It] [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api expand_less | 3m21s

{ failed [FAILED] Timed out after 180.001s.
Expected
    <string>: Summary
to match fields: {
[.Pods[summary-test-836::stats-busybox-0].Network:
	Expected
	    <*v1alpha1.NetworkStats | 0x0>: nil
	not to be <nil>, .Pods[summary-test-836::stats-busybox-1].Network:
	Expected
	    <*v1alpha1.NetworkStats | 0x0>: nil
	not to be <nil>]
}
In [It] at: test/e2e_node/summary_test.go:332 @ 05/08/23 22:58:22.455
}

kolyshkin · 2023-05-09T18:32:32Z

With the change from #3302 (comment) the e2e tests are now passing in kubernetes/kubernetes#117833 (this time I ran crio-e2e as well).

pull-kubernetes-node-crio-e2e — Job succeeded.
pull-kubernetes-node-e2e-containerd — Job succeeded.

bobbypage · 2023-05-09T19:03:44Z

all updates LGTM!

thank you for testing in k/k and glad to see it was helpful and caught a potential issue! :)

bobbypage · 2023-05-09T19:05:17Z

@iwankgb please take a final look and merge if it's all ok from your side. LGTM from me.

kolyshkin · 2023-05-09T21:26:50Z

thank you for testing in k/k and glad to see it was helpful and caught a potential issue! :)

I actually found the bug while re-reviewing this PR first, and then took a look at test results. In retrospect, it's good that we validated that the e2e test works.

haircommander · 2024-09-12T16:56:15Z

here's a follow up to this we can make for cri-o #3592

k8s-ci-robot added the needs-ok-to-test label May 4, 2023

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels May 4, 2023

kolyshkin force-pushed the optimize-net-metrics branch from 493a5b4 to 575c030 Compare May 4, 2023 01:29

kolyshkin mentioned this pull request May 4, 2023

REQUEST: New membership for @kolyshkin kubernetes/org#4192

Closed

9 tasks

bobbypage reviewed May 4, 2023

View reviewed changes

kolyshkin mentioned this pull request May 6, 2023

[DNM] testing cadvisor changes kubernetes/kubernetes#117833

Closed

kolyshkin marked this pull request as draft May 8, 2023 21:30

kolyshkin marked this pull request as ready for review May 8, 2023 21:43

kolyshkin force-pushed the optimize-net-metrics branch from 575c030 to 177b129 Compare May 8, 2023 21:51

iwankgb approved these changes May 9, 2023

View reviewed changes

kolyshkin force-pushed the optimize-net-metrics branch from 177b129 to eac1257 Compare May 9, 2023 17:21

kolyshkin added a commit to kolyshkin/kubernetes that referenced this pull request May 9, 2023

[DNM] testing cadvisor changes

462b967

This is just to run e2e tests with cAdvisor from google/cadvisor#3302 Signed-off-by: Kir Kolyshkin <[email protected]>

iwankgb merged commit 31361f2 into google:master Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize network metrics collection #3302

Optimize network metrics collection #3302

kolyshkin commented May 4, 2023

k8s-ci-robot commented May 4, 2023

mrunalp commented May 4, 2023

kolyshkin commented May 4, 2023

bobbypage commented May 4, 2023 •

edited

Loading

bobbypage May 4, 2023

kolyshkin May 6, 2023

This comment was marked as off-topic.

bobbypage May 4, 2023

kolyshkin May 6, 2023

bobbypage commented May 4, 2023 •

edited

Loading

kolyshkin commented May 6, 2023

BenTheElder commented May 6, 2023

iwankgb commented May 7, 2023

kolyshkin commented May 8, 2023

iwankgb commented May 9, 2023

kolyshkin commented May 9, 2023

kolyshkin commented May 9, 2023 •

edited

Loading

kolyshkin commented May 9, 2023

bobbypage commented May 9, 2023

bobbypage commented May 9, 2023

kolyshkin commented May 9, 2023

haircommander commented Sep 12, 2024

Optimize network metrics collection #3302

Optimize network metrics collection #3302

Conversation

kolyshkin commented May 4, 2023

k8s-ci-robot commented May 4, 2023

mrunalp commented May 4, 2023

kolyshkin commented May 4, 2023

bobbypage commented May 4, 2023 • edited Loading

bobbypage May 4, 2023

Choose a reason for hiding this comment

kolyshkin May 6, 2023

Choose a reason for hiding this comment

This comment was marked as off-topic.

bobbypage May 4, 2023

Choose a reason for hiding this comment

kolyshkin May 6, 2023

Choose a reason for hiding this comment

bobbypage commented May 4, 2023 • edited Loading

kolyshkin commented May 6, 2023

BenTheElder commented May 6, 2023

iwankgb commented May 7, 2023

kolyshkin commented May 8, 2023

iwankgb commented May 9, 2023

kolyshkin commented May 9, 2023

kolyshkin commented May 9, 2023 • edited Loading

kolyshkin commented May 9, 2023

bobbypage commented May 9, 2023

bobbypage commented May 9, 2023

kolyshkin commented May 9, 2023

haircommander commented Sep 12, 2024

bobbypage commented May 4, 2023 •

edited

Loading

bobbypage commented May 4, 2023 •

edited

Loading

kolyshkin commented May 9, 2023 •

edited

Loading