-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581
base: master
Are you sure you want to change the base?
Conversation
Might be related to https://calicousers.slack.com/archives/CPEPF833L/p1733131626596509 |
Thanks for providing the reference! IHMO, there are two issues that need improvement: first, the conntrack map does not scale flexibly with CPU resources; second, we lack counters to monitor the conntrack map size, which means we cannot track its current, max, or limit values. BTW, we're working on the second issue, which will make it easier to troubleshoot problems with the conntrack map in production. Thanks, |
re. the lack of counters, in the thread that Tomas linked, @fasaxc said:
|
I have some nervousness about this approach to auto-scaling - it seems that the intent here is to scale the map size based on CPU size. But map size is about memory, isn't it? And free memory on a node depends entirely about how tightly you pack workload pods onto the node. So I'm not entirely sure about the basic premise of this PR. I'm worried that we create an API that we have to support forever, but isn't a good experience for most users.
The "lacking flexibility" part is not entirely true. The
Now clearly if you have a lot of large nodes, that's going to be a pain to configure. ISTR @caseydavenport proposed a system where we add matchlabels to Felixconfigs, and you would then label the nodes so that the correct felixconfig applies to them, and you would only need a felixconfig per node-type in your cluster. That might solve this issue in a way which works for more users? |
Ah... I didn't notice them before, and they are really useful as well, thanks! However, when it comes to troubleshooting issues with the conntrack map, we still need counters for conntrack creation failure events, IHMO ;) |
Thanks a lot for the points! As a CNI like Calico, I believe that auto-scaling is appropriate. It might also be better to adjust the map size based on available memory, as you suggested, similar to how network stack parameters such as TCP buffer sizes are dynamically adjusted according to system memory in Linux kernel, IIRC. Moreover, maintaining Felix configurations for each node-type could be a little challenging; nobody dislikes having less config, I guess. Kubernetes aims to free us from managing individual machines, allowing us to focus less on the specific types of machines within a cluster, IHMO. what do you think? |
I'm just wondering if it is possible for you to take a review? |
@tomastigera PING ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.
Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.
It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.
Thanks a lot for taking time to review!
Yeah, this is a simple/harmless change that effectively solves the issue we're facing ;)
Totally agreed. It's actually more like future-proofing work, so let's do it separately ;p
However, IMHO, it might be a bit challenging, as we currently lack a clear way to know how many groups of nodes there are, whether they are faster or slower ;( |
@fasaxc everything is ready, could please take a review again? |
/sem-approve |
Previously, BPFMapSizeConntrack was set to a fixed value, which lacked flexibility across different spec of machines. Now, with the introduction of BPFMapSizePerCPUConntrack, the conntrack map size can be adjusted dynamically as BPFMapSizeConntrackPerCPU * (Number of CPUs). This allows for larger map sizes on high-spec machines and smaller map sizes on lower-spec machines, optimizing resource usage accordingly. BTW, recently, we added several high-spec machines to the data center, which led to the conntrack map size being filled up on these machines. It could have been avoided with BPFMapSizePerCPUConntrack IMHO. Suggested-by: Tomas Hruby <[email protected]> Signed-off-by: Mingzhe Yang <[email protected]> Co-Authored-By: Amaindex <[email protected]> Co-Authored-By: Lance Yang <[email protected]>
/sem-approve |
Okay, I cannot launch the workflow :( |
Description
Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines. Now, with the introduction
of BPFMapSizePerCPUConntrack, the conntrack map size can be adjusted
dynamically as BPFMapSizeConntrackPerCPU * (Number of CPUs). This allows
for larger map sizes on high-spec machines and smaller map sizes on
lower-spec machines, optimizing resource usage accordingly.
BTW, recently, we added several high-spec machines to the data center,
which led to the conntrack map size being filled up on these machines. It
could have been avoided with BPFMapSizePerCPUConntrack IHMO.
Related issues/PRs
Todos
Release Note
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one
docs-*
label.docs-pr-required
: This change requires a change to the documentation that has not been completed yet.docs-completed
: This change has all necessary documentation completed.docs-not-required
: This change has no user-facing impact and requires no docs.Every PR needs one
release-note-*
label.release-note-required
: This PR has user-facing changes. Most PRs should have this label.release-note-not-required
: This PR has no user-facing changes.Other optional labels:
cherry-pick-candidate
: This PR should be cherry-picked to an earlier release. For bug fixes only.needs-operator-pr
: This PR is related to install and requires a corresponding change to the operator.