[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581

ioworker0 · 2024-12-10T03:36:36Z

Description

new feature

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines. Now, with the introduction
of BPFMapSizePerCPUConntrack, the conntrack map size can be adjusted
dynamically as BPFMapSizeConntrackPerCPU * (Number of CPUs). This allows
for larger map sizes on high-spec machines and smaller map sizes on
lower-spec machines, optimizing resource usage accordingly.

BTW, recently, we added several high-spec machines to the data center,
which led to the conntrack map size being filled up on these machines. It
could have been avoided with BPFMapSizePerCPUConntrack IHMO.

Related issues/PRs

Todos

Tests
Documentation
Release note

Release Note

Enhance conntrack map flexibility with CPU-based scaling.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

CLAassistant · 2024-12-10T03:36:43Z

All committers have signed the CLA.

tomastigera · 2024-12-10T18:25:43Z

Might be related to https://calicousers.slack.com/archives/CPEPF833L/p1733131626596509

ioworker0 · 2024-12-11T02:52:53Z

Might be related to https://calicousers.slack.com/archives/CPEPF833L/p1733131626596509

Thanks for providing the reference!

IHMO, there are two issues that need improvement: first, the conntrack map does not scale flexibly with CPU resources; second, we lack counters to monitor the conntrack map size, which means we cannot track its current, max, or limit values.

BTW, we're working on the second issue, which will make it easier to troubleshoot problems with the conntrack map in production.

Thanks,
Lance

lwr20 · 2024-12-11T10:28:00Z

re. the lack of counters, in the thread that Tomas linked, @fasaxc said:

I think we do have counters on master: https://github.com/projectcalico/calico/blob/master/felix/bpf/conntrack/bpf_scanner.go#L43
I added them when implementing the new BPF-based conntrack cleaner

lwr20 · 2024-12-11T10:41:11Z

I have some nervousness about this approach to auto-scaling - it seems that the intent here is to scale the map size based on CPU size. But map size is about memory, isn't it? And free memory on a node depends entirely about how tightly you pack workload pods onto the node. So I'm not entirely sure about the basic premise of this PR. I'm worried that we create an API that we have to support forever, but isn't a good experience for most users.

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines.

The "lacking flexibility" part is not entirely true. The default felixconfig applies globally, but it can be overridden for particular nodes by making felixconfigs called node.<nodename>. See https://docs.tigera.io/calico/latest/reference/resources/felixconfig#felixconfig

Calico automatically creates a resource named default containing the global default configuration settings for Felix. You can use calicoctl to view and edit these settings
The resources with the name node. contain the node-specific overrides, and will be applied to the node . When deleting a node the FelixConfiguration resource associated with the node will also be deleted.

Now clearly if you have a lot of large nodes, that's going to be a pain to configure. ISTR @caseydavenport proposed a system where we add matchlabels to Felixconfigs, and you would then label the nodes so that the correct felixconfig applies to them, and you would only need a felixconfig per node-type in your cluster. That might solve this issue in a way which works for more users?

ioworker0 · 2024-12-11T10:51:32Z

re. the lack of counters, in the thread that Tomas linked, @fasaxc said:

I think we do have counters on master: https://github.com/projectcalico/calico/blob/master/felix/bpf/conntrack/bpf_scanner.go#L43
I added them when implementing the new BPF-based conntrack cleaner

Ah... I didn't notice them before, and they are really useful as well, thanks!

However, when it comes to troubleshooting issues with the conntrack map, we still need counters for conntrack creation failure events, IHMO ;)

ioworker0 · 2024-12-11T11:27:17Z

I have some nervousness about this approach to auto-scaling - it seems that the intent here is to scale the map size based on CPU size. But map size is about memory, isn't it? And free memory on a node depends entirely about how tightly you pack workload pods onto the node. So I'm not entirely sure about the basic premise of this PR. I'm worried that we create an API that we have to support forever, but isn't a good experience for most users.

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked
flexibility across different spec of machines.

The "lacking flexibility" part is not entirely true. The default felixconfig applies globally, but it can be overridden for particular nodes by making felixconfigs called node.<nodename>. See https://docs.tigera.io/calico/latest/reference/resources/felixconfig#felixconfig

Calico automatically creates a resource named default containing the global default configuration settings for Felix. You can use calicoctl to view and edit these settings
The resources with the name node. contain the node-specific overrides, and will be applied to the node . When deleting a node the FelixConfiguration resource associated with the node will also be deleted.

Now clearly if you have a lot of large nodes, that's going to be a pain to configure. ISTR @caseydavenport proposed a system where we add matchlabels to Felixconfigs, and you would then label the nodes so that the correct felixconfig applies to them, and you would only need a felixconfig per node-type in your cluster. That might solve this issue in a way which works for more users?

Thanks a lot for the points!

As a CNI like Calico, I believe that auto-scaling is appropriate. It might also be better to adjust the map size based on available memory, as you suggested, similar to how network stack parameters such as TCP buffer sizes are dynamically adjusted according to system memory in Linux kernel, IIRC.

Moreover, maintaining Felix configurations for each node-type could be a little challenging; nobody dislikes having less config, I guess. Kubernetes aims to free us from managing individual machines, allowing us to focus less on the specific types of machines within a cluster, IHMO.

what do you think?

ioworker0 · 2024-12-12T12:45:45Z

@tomastigera

I'm just wondering if it is possible for you to take a review?

ioworker0 · 2024-12-30T02:16:07Z

@tomastigera

@tomastigera PING ~

fasaxc

I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.

Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.

It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.

felix/dataplane/linux/int_dataplane.go

api/pkg/apis/projectcalico/v3/felixconfig.go

ioworker0 · 2025-01-03T01:43:37Z

Thanks a lot for taking time to review!

I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.

Yeah, this is a simple/harmless change that effectively solves the issue we're facing ;)

Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.

Totally agreed. It's actually more like future-proofing work, so let's do it separately ;p

It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.

However, IMHO, it might be a bit challenging, as we currently lack a clear way to know how many groups of nodes there are, whether they are faster or slower ;(

ioworker0 · 2025-01-03T11:29:23Z

Thanks a lot for taking time to review!

I think this is a reasonable incremental improvement on what we have that doesn't do any harm if you don't actively use it.

Yeah, this is a simple/harmless change that effectively solves the issue we're facing ;)

Ideally, we'd auto-size the conntrack map based on load. If we see it get full, double its size, but that's a lot more work.

Totally agreed. It's actually more like future-proofing work, so let's do it separately ;p

It'd also be nice to have a config model that allows for configuring groups of similar nodes by selector.

However, IMHO, it might be a bit challenging, as we currently lack a clear way to know how many groups of nodes there are, whether they are faster or slower ;(

@fasaxc everything is ready, could please take a review again?

fasaxc · 2025-01-03T14:33:07Z

/sem-approve

Previously, BPFMapSizeConntrack was set to a fixed value, which lacked flexibility across different spec of machines. Now, with the introduction of BPFMapSizePerCPUConntrack, the conntrack map size can be adjusted dynamically as BPFMapSizeConntrackPerCPU * (Number of CPUs). This allows for larger map sizes on high-spec machines and smaller map sizes on lower-spec machines, optimizing resource usage accordingly. BTW, recently, we added several high-spec machines to the data center, which led to the conntrack map size being filled up on these machines. It could have been avoided with BPFMapSizePerCPUConntrack IMHO. Suggested-by: Tomas Hruby <[email protected]> Signed-off-by: Mingzhe Yang <[email protected]> Co-Authored-By: Amaindex <[email protected]> Co-Authored-By: Lance Yang <[email protected]>

ioworker0 · 2025-01-04T03:51:21Z

/sem-approve

ioworker0 · 2025-01-04T03:59:29Z

/sem-approve

Okay, I cannot launch the workflow :(

ioworker0 requested a review from a team as a code owner December 10, 2024 03:36

marvin-tigera added this to the Calico v3.30.0 milestone Dec 10, 2024

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 10, 2024

tomastigera self-assigned this Dec 10, 2024

ioworker0 closed this Dec 11, 2024

marvin-tigera removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 11, 2024

ioworker0 reopened this Dec 11, 2024

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 11, 2024

ioworker0 force-pushed the master branch from 2665d4d to 0d622db Compare December 12, 2024 12:43

fasaxc requested changes Jan 2, 2025

View reviewed changes

felix/dataplane/linux/int_dataplane.go Outdated Show resolved Hide resolved

api/pkg/apis/projectcalico/v3/felixconfig.go Outdated Show resolved Hide resolved

ioworker0 force-pushed the master branch from 0d622db to 61369e6 Compare January 3, 2025 11:26

ioworker0 force-pushed the master branch from 61369e6 to bc7e67a Compare January 4, 2025 03:50

ioworker0 requested a review from fasaxc January 4, 2025 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581

[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581

ioworker0 commented Dec 10, 2024 •

edited

Loading

CLAassistant commented Dec 10, 2024 •

edited

Loading

tomastigera commented Dec 10, 2024

ioworker0 commented Dec 11, 2024 •

edited

Loading

lwr20 commented Dec 11, 2024

lwr20 commented Dec 11, 2024

ioworker0 commented Dec 11, 2024

ioworker0 commented Dec 11, 2024

ioworker0 commented Dec 12, 2024

ioworker0 commented Dec 30, 2024

fasaxc left a comment

ioworker0 commented Jan 3, 2025

ioworker0 commented Jan 3, 2025

fasaxc commented Jan 3, 2025

ioworker0 commented Jan 4, 2025

ioworker0 commented Jan 4, 2025

[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581

Are you sure you want to change the base?

[BPF] Enhance conntrack map flexibility with CPU-based scaling #9581

Conversation

ioworker0 commented Dec 10, 2024 • edited Loading

Description

Related issues/PRs

Todos

Release Note

Reminder for the reviewer

CLAassistant commented Dec 10, 2024 • edited Loading

tomastigera commented Dec 10, 2024

ioworker0 commented Dec 11, 2024 • edited Loading

lwr20 commented Dec 11, 2024

lwr20 commented Dec 11, 2024

ioworker0 commented Dec 11, 2024

ioworker0 commented Dec 11, 2024

ioworker0 commented Dec 12, 2024

ioworker0 commented Dec 30, 2024

fasaxc left a comment

Choose a reason for hiding this comment

ioworker0 commented Jan 3, 2025

ioworker0 commented Jan 3, 2025

fasaxc commented Jan 3, 2025

ioworker0 commented Jan 4, 2025

ioworker0 commented Jan 4, 2025

ioworker0 commented Dec 10, 2024 •

edited

Loading

CLAassistant commented Dec 10, 2024 •

edited

Loading

ioworker0 commented Dec 11, 2024 •

edited

Loading