Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(env): Add automatic memory limit handling and move automaxprocs to pkg/util package #3806

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dongjiang1989
Copy link
Contributor

@dongjiang1989 dongjiang1989 commented Nov 8, 2024

Is your feature request related to a problem?

It is good practice to set the Go runtime environment variables GOMAXPROCS and GOMEMLIMIT, which tell Go the max threads/memory it should use. We should set these in our containers that run Go (namely gloo and discovery) whenever resource limits are set.
Automatically set GOMEMLIMIT to match Linux cgroups(7) memory limit.

See:
https://pkg.go.dev/runtime
https://blog.howardjohn.info/posts/gomaxprocs/
https://github.com/KimMachineGun/automemlimit/

Describe the solution you'd like

  1. add automatic memory limit handling
  2. move automaxprocs to pkg/util package
$ ./controller-manager --kubeconfig ~/.kube/config --auto-gomemlimit-ratio=0.8
I1108 11:42:04.897134   51417 cpu.go:28] Leaving GOMAXPROCS=8: CPU quota undefined
I1108 11:42:04.897584   51417 memory.go:52] GOMEMLIMIT set to 6871947673
I1108 11:42:04.901564   51417 leaderelection.go:254] attempting to acquire leader lease volcano-system/vc-controller-manager...

Result

My conclusion from the kubernetes cluster, is it that setting the GOMEMLIMIT equal to the memory limit has the following advantages:

  • Significantly lower memory usage (at the cost of slightly higher CPU usage)
  • Fewer OOMKILL container failures
  • Improved ability to recover from OOMKILL failures and continue reconciling operator, scheduler and controller

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 8, 2024
Signed-off-by: dongjiang1989 <[email protected]>
@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 8, 2024
Signed-off-by: dongjiang1989 <[email protected]>
Signed-off-by: dongjiang1989 <[email protected]>
@dongjiang1989
Copy link
Contributor Author

Maybe java GC, Please re-run fail action, thanks

@hwdef
Copy link
Member

hwdef commented Nov 8, 2024

you can close and reopen this PR to re-run the CIs

@Monokaix
Copy link
Member

Can you support more information about why these are needed?

@dongjiang1989
Copy link
Contributor Author

Can you support more information about why these are needed?

Thanks. Add information done.

@Monokaix
Copy link
Member

Welcome! just a little doubt here:

  1. What's the behavior(memory can be used by container) when both set container memory limit and GOMEMLIMIT?Is there any conflict?
  2. Does GOMEMLIMIT depend on the cgroup version?
  3. Under what circumstance does user need auto gomemlimit?

@dongjiang1989
Copy link
Contributor Author

dongjiang1989 commented Nov 15, 2024

Welcome! just a little doubt here:

  1. What's the behavior(memory can be used by container) when both set container memory limit and GOMEMLIMIT?Is there any conflict?
  2. Does GOMEMLIMIT depend on the cgroup version?
  3. Under what circumstance does user need auto gomemlimit?

Thanks @Monokaix

  1. Can both set, no direct conflict.
    The container memory limit and GOMEMLIMIT operate at different levels. The container memory limit is a more coarse - grained control imposed by the container runtime or the host operating system. It simply caps the total memory usage of the container's processes.
    But GOMEMLIMIT is a finer - grained control that the Go runtime uses to manage its internal memory usage, specifically for the heap memory managed by the garbage collector.
    However. Unreasonable settings of container memory limit and GOMEMLIMIT can also cause OOM. Therefore, auto setting gomemlimit is required.

  2. Support cgroup v1 and v2.
    Ref: https://github.com/KimMachineGun/automemlimit

  3. The final state does not require users to care.
    The final state does not require the user to care. In container deployment, auto setting gomemlimit.
    In the current state, user-configurable feature flags are exposed.
    Ref: https://github.com/prometheus/prometheus/releases/tag/v3.0.0
    image

Signed-off-by: dongjiang1989 <[email protected]>
@dongjiang1989 dongjiang1989 changed the title feat(env): Add automatic memory limit handling and move automaxprocs to internal/goruntime package feat(env): Add automatic memory limit handling and move automaxprocs to pkg/util package Nov 15, 2024
@dongjiang1989
Copy link
Contributor Author

/ok-to-test

@volcano-sh-bot
Copy link
Contributor

@dongjiang1989: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dongjiang1989
Copy link
Contributor Author

/assign @shinytang6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants