Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ioworker0
Copy link

@ioworker0 ioworker0 commented Dec 10, 2024

Description

  • new feature

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines. Now, with the introduction
of BPFMapSizePerCPUConntrack, the conntrack map size can be adjusted
dynamically as BPFMapSizeConntrackPerCPU * (Number of CPUs). This allows
for larger map sizes on high-spec machines and smaller map sizes on
lower-spec machines, optimizing resource usage accordingly.

BTW, recently, we added several high-spec machines to the data center,
which led to the conntrack map size being filled up on these machines. It
could have been avoided with BPFMapSizePerCPUConntrack IHMO.

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Enhance conntrack map flexibility with CPU-based scaling.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@ioworker0 ioworker0 requested a review from a team as a code owner December 10, 2024 03:36
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Dec 10, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 10, 2024
@CLAassistant
Copy link

CLAassistant commented Dec 10, 2024

CLA assistant check
All committers have signed the CLA.

@tomastigera tomastigera self-assigned this Dec 10, 2024
@tomastigera
Copy link
Contributor

@ioworker0
Copy link
Author

ioworker0 commented Dec 11, 2024

Might be related to https://calicousers.slack.com/archives/CPEPF833L/p1733131626596509

Thanks for providing the reference!

IHMO, there are two issues that need improvement: first, the conntrack map does not scale flexibly with CPU resources; second, we lack counters to monitor the conntrack map size, which means we cannot track its current, max, or limit values.

BTW, we're working on the second issue, which will make it easier to troubleshoot problems with the conntrack map in production.

Thanks,
Lance

@lwr20
Copy link
Member

lwr20 commented Dec 11, 2024

re. the lack of counters, in the thread that Tomas linked, @fasaxc said:

I think we do have counters on master: https://github.com/projectcalico/calico/blob/master/felix/bpf/conntrack/bpf_scanner.go#L43
I added them when implementing the new BPF-based conntrack cleaner

@lwr20
Copy link
Member

lwr20 commented Dec 11, 2024

I have some nervousness about this approach to auto-scaling - it seems that the intent here is to scale the map size based on CPU size. But map size is about memory, isn't it? And free memory on a node depends entirely about how tightly you pack workload pods onto the node. So I'm not entirely sure about the basic premise of this PR. I'm worried that we create an API that we have to support forever, but isn't a good experience for most users.

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines.

The "lacking flexibility" part is not entirely true. The default felixconfig applies globally, but it can be overridden for particular nodes by making felixconfigs called node.<nodename>. See https://docs.tigera.io/calico/latest/reference/resources/felixconfig#felixconfig

Calico automatically creates a resource named default containing the global default configuration settings for Felix. You can use calicoctl to view and edit these settings
The resources with the name node. contain the node-specific overrides, and will be applied to the node . When deleting a node the FelixConfiguration resource associated with the node will also be deleted.

Now clearly if you have a lot of large nodes, that's going to be a pain to configure. ISTR @caseydavenport proposed a system where we add matchlabels to Felixconfigs, and you would then label the nodes so that the correct felixconfig applies to them, and you would only need a felixconfig per node-type in your cluster. That might solve this issue in a way which works for more users?

@ioworker0 ioworker0 closed this Dec 11, 2024
@marvin-tigera marvin-tigera removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 11, 2024
@ioworker0
Copy link
Author

re. the lack of counters, in the thread that Tomas linked, @fasaxc said:

I think we do have counters on master: https://github.com/projectcalico/calico/blob/master/felix/bpf/conntrack/bpf_scanner.go#L43
I added them when implementing the new BPF-based conntrack cleaner

Ah... I didn't notice them before, and they are really useful as well, thanks!

However, when it comes to troubleshooting issues with the conntrack map, we still need counters for conntrack creation failure events, IHMO ;)

@ioworker0 ioworker0 reopened this Dec 11, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 11, 2024
@ioworker0
Copy link
Author

I have some nervousness about this approach to auto-scaling - it seems that the intent here is to scale the map size based on CPU size. But map size is about memory, isn't it? And free memory on a node depends entirely about how tightly you pack workload pods onto the node. So I'm not entirely sure about the basic premise of this PR. I'm worried that we create an API that we have to support forever, but isn't a good experience for most users.

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines.

The "lacking flexibility" part is not entirely true. The default felixconfig applies globally, but it can be overridden for particular nodes by making felixconfigs called node.<nodename>. See https://docs.tigera.io/calico/latest/reference/resources/felixconfig#felixconfig

Calico automatically creates a resource named default containing the global default configuration settings for Felix. You can use calicoctl to view and edit these settings
The resources with the name node. contain the node-specific overrides, and will be applied to the node . When deleting a node the FelixConfiguration resource associated with the node will also be deleted.

Now clearly if you have a lot of large nodes, that's going to be a pain to configure. ISTR @caseydavenport proposed a system where we add matchlabels to Felixconfigs, and you would then label the nodes so that the correct felixconfig applies to them, and you would only need a felixconfig per node-type in your cluster. That might solve this issue in a way which works for more users?

Thanks a lot for the points!

As a CNI like Calico, I believe that auto-scaling is appropriate. It might also be better to adjust the map size based on available memory, as you suggested, similar to how network stack parameters such as TCP buffer sizes are dynamically adjusted according to system memory in Linux kernel, IIRC.

Moreover, maintaining Felix configurations for each node-type could be a little challenging; nobody dislikes having less config, I guess. Kubernetes aims to free us from managing individual machines, allowing us to focus less on the specific types of machines within a cluster, IHMO.

what do you think?

@ioworker0
Copy link
Author

@tomastigera

I'm just wondering if it is possible for you to take a review?

@ioworker0
Copy link
Author

@tomastigera

@tomastigera PING ~

Copy link
Member

@fasaxc fasaxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.

Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.

It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.

felix/dataplane/linux/int_dataplane.go Outdated Show resolved Hide resolved
api/pkg/apis/projectcalico/v3/felixconfig.go Outdated Show resolved Hide resolved
@ioworker0
Copy link
Author

Thanks a lot for taking time to review!

I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.

Yeah, this is a simple/harmless change that effectively solves the issue we're facing ;)

Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.

Totally agreed. It's actually more like future-proofing work, so let's do it separately ;p

It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.

However, IMHO, it might be a bit challenging, as we currently lack a clear way to know how many groups of nodes there are, whether they are faster or slower ;(

@ioworker0
Copy link
Author

Thanks a lot for taking time to review!

I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.

Yeah, this is a simple/harmless change that effectively solves the issue we're facing ;)

Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.

Totally agreed. It's actually more like future-proofing work, so let's do it separately ;p

It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.

However, IMHO, it might be a bit challenging, as we currently lack a clear way to know how many groups of nodes there are, whether they are faster or slower ;(

@fasaxc everything is ready, could please take a review again?

@fasaxc
Copy link
Member

fasaxc commented Jan 3, 2025

/sem-approve

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines. Now, with the introduction
of BPFMapSizePerCPUConntrack, the conntrack map size can be adjusted
dynamically as BPFMapSizeConntrackPerCPU * (Number of CPUs). This allows
for larger map sizes on high-spec machines and smaller map sizes on
lower-spec machines, optimizing resource usage accordingly.

BTW, recently, we added several high-spec machines to the data center,
which led to the conntrack map size being filled up on these machines. It
could have been avoided with BPFMapSizePerCPUConntrack IMHO.

Suggested-by: Tomas Hruby <[email protected]>
Signed-off-by: Mingzhe Yang <[email protected]>
Co-Authored-By: Amaindex <[email protected]>
Co-Authored-By: Lance Yang <[email protected]>
@ioworker0
Copy link
Author

/sem-approve

@ioworker0 ioworker0 requested a review from fasaxc January 4, 2025 03:54
@ioworker0
Copy link
Author

/sem-approve

Okay, I cannot launch the workflow :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants