Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: optimize memory usage #94

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

howardjohn
Copy link

In a ~15k Pod cluster, I see this drop memory footprint from about 200MB to 66MB.

This has a few improvements:

  • Don't just strip managed fields, but keep only exactly what we want.
  • Intern strips to reduce duplication of common things (label keys are almost always duplicated, etc)
  • During startup only, GC more aggressively. The problem with the above approach is we get the full object then drop it. So if we don't manually GC, we will bloat up and not recover for a long time.

Some of this is overkill I think, I will see what I can remove and see what benefits remain so we don't have too much complexity where we don't get benefits.

Before/after: see the green vs purple line, ignore the rest...

2024-10-03_14-49-51

In a ~15k Pod cluster, I see this drop memory footprint from about 200MB
to 66MB.

This has a few improvements:
* Don't just strip managed fields, but keep only exactly what we want.
* Intern strips to reduce duplication of common things (label keys are
  almost always duplicated, etc)
* During startup only, GC more aggressively. The problem with the above
  approach is we get the full object *then* drop it. So if we don't
manually GC, we will bloat up and not recover for a long time.

Some of this is overkill I think, I will see what I can remove and see
what benefits remain so we don't have too much complexity where we don't
get benefits.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: howardjohn
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 3, 2024
@howardjohn howardjohn marked this pull request as draft October 3, 2024 22:17
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 3, 2024
for _, i := range po.Status.PodIPs {
ips = append(ips, v1.PodIP{IP: intern(i.IP)})
}
return &v1.Pod{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not replacing in place cheaper?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, apparently SetTransform is the exception to the normal "you can't modify objects from the informer cache" rule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it runs before goes to the informer cache IIRC

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't replace in place (and made sure to copy every item entirely) is due to how Go will sometimes not GC a full struct if you hold a reference to something within the struct. https://stackoverflow.com/a/55018900

However, I doubt this applies here, or if it is it can be avoided in a simpler manner -- I was measuring things wrong so took an overkill approach. ill clean it up

if isInitialList {
pods++
if pods%200 == 0 {
runtime.GC()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this scares me

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't scare me but it seems kind of ugly and maybe would fit better in the the workqueue-processing code (eg processNextItem) rather than the informer itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe GOMEMLIMIT would be a more conventional approach?

Comment on lines +185 to +194
ObjectMeta: metav1.ObjectMeta{
Name: intern(po.Name),
Namespace: intern(po.Namespace),
Labels: internm(po.Labels),
},
Spec: v1.PodSpec{Hostname: intern(po.Spec.Hostname), NodeName: intern(po.Spec.NodeName)},
Status: v1.PodStatus{
Phase: v1.PodPhase(intern(string(po.Status.Phase))),
PodIPs: ips,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I love to do is this server side, so you can trim the objects before sending them through the wire, there is a metadata informer only, but still need these values on Spec and Status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to do is this server side, so you can trim the objects before sending them through the wire

long term, the most efficient thing would be to have a central controller that processes the Pods and distributes processed data about them to the per-node kube-network-policy processes (in the same way kcm processes Pods into EndpointSlices so kube-proxy only needs to watch the latter). Using an API object (possibly even EndpointSlice) would be one way to do that, but using a grpc API (like xDS) would probably be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I started that path, and have a branch with that, but I didn't like moving to a two components models ... xDS is cool but in this case this is a cache synchronization problem, so current informers we'll be enough for efficient data transmission and will remove the additional dependencies

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing you could do is to have, on the client side, a custom Pod type that is only the subset. While the apiserver will send the full pod, it will only end up as bytes and not deserialized into a Pod object. I did similar a while back with decent results (but it wasn't with Kubernetes; its likely a pain to do with client-go).

Then you probably don't need the GC thing either.

Obviously server side is best though


nfqueue "github.com/florianl/go-nfqueue"
"github.com/mdlayher/netlink"

v1 "k8s.io/api/core/v1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it with the blank line there.

(But we should adopt a consistent style. Currently:

  • pkg/networkpolicy/controller.go has a blank line between the github and k8s imports
  • pkg/networkpolicy/metrics.go puts them together
  • cmd/main.go is a mess

)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do your magic :)

for _, i := range po.Status.PodIPs {
ips = append(ips, v1.PodIP{IP: intern(i.IP)})
}
return &v1.Pod{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, apparently SetTransform is the exception to the normal "you can't modify objects from the informer cache" rule

if po, ok := obj.(*v1.Pod); ok {
ips := make([]v1.PodIP, 0, len(po.Status.PodIPs))
for _, i := range po.Status.PodIPs {
ips = append(ips, v1.PodIP{IP: intern(i.IP)})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod IPs don't get reused very quickly... This probably doesn't help much...
(Is there a performance tradeoff between "memory saved by interning" vs "extra allocations used in the process of interning"?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is definitely likely to be dropped when I try to minimize the diff here. I did too much, mostly because I was misreading some pprofs before I added the explicitly GC() step. I doubt this adds any value.

Namespace: intern(po.Namespace),
Labels: internm(po.Labels),
},
Spec: v1.PodSpec{Hostname: intern(po.Spec.Hostname), NodeName: intern(po.Spec.NodeName)},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code doesn't seem to use Hostname

Comment on lines +185 to +194
ObjectMeta: metav1.ObjectMeta{
Name: intern(po.Name),
Namespace: intern(po.Namespace),
Labels: internm(po.Labels),
},
Spec: v1.PodSpec{Hostname: intern(po.Spec.Hostname), NodeName: intern(po.Spec.NodeName)},
Status: v1.PodStatus{
Phase: v1.PodPhase(intern(string(po.Status.Phase))),
PodIPs: ips,
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to do is this server side, so you can trim the objects before sending them through the wire

long term, the most efficient thing would be to have a central controller that processes the Pods and distributes processed data about them to the per-node kube-network-policy processes (in the same way kcm processes Pods into EndpointSlices so kube-proxy only needs to watch the latter). Using an API object (possibly even EndpointSlice) would be one way to do that, but using a grpc API (like xDS) would probably be better.

if isInitialList {
pods++
if pods%200 == 0 {
runtime.GC()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't scare me but it seems kind of ugly and maybe would fit better in the the workqueue-processing code (eg processNextItem) rather than the informer itself?

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants