Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsexec: moving as much as we can to Go #3951

Open
2 of 9 tasks
cyphar opened this issue Aug 1, 2023 · 2 comments
Open
2 of 9 tasks

nsexec: moving as much as we can to Go #3951

cyphar opened this issue Aug 1, 2023 · 2 comments

Comments

@cyphar
Copy link
Member

cyphar commented Aug 1, 2023

Here is a list of things that nsexec does today, and a description of possible ideas and challenges to moving them to Go:

  • Basic namespace creation can be done with syscall.SysProcAttr today. There is code in nsexec.c that fixes very old kernel bugs (such as userns ownership being broken on old RHEL 6 kernels) which I'm sure Go stdlib doesn't care about (and maybe we shouldn't either). However there are some things that I suspect are not ideal:
    • For rootless containers, are newuidmap fallbacks supported by Go stdlib? I suspect not. Though obviously we could write PRs to add them to stdlib...
    • We currently do CLONE_NEWCGROUP far later during setup than the other namespaces, but this is actually a vestigial implementation detail of the original cgroupns code (which incorrectly added a synchronisation step to "ensure" runc init was in the correct cgroup). The logic probably should've been removed in 5110bd2.
  • We might want to use CLONE_INTO_CGROUP (SysProcAttr.CgroupFD, which should make all container nsenter-related operations faster because it avoids taking a bunch of locks in cgroup-core), but with nsexec: cloned_binary: remove bindfd logic entirely #3931 we need the memfd_create() and /proc/self/exe copy to be executed outside of the container cgroup. This is easy to fix by just doing the clone in Go code (we could even make it part of the runc init fork+exec -- to remove one extra exec).
  • Bind mount sources (Open bind mount sources from the host userns #2576) are currently handled in a way that requires C code because we need to temporarily join the container mount namespace to open the mount sources (the kernel blocks bind-mount sources from a different mount namespace). However, open_tree(OPEN_TREE_CLONE) allows us to get around this, meaning that if we bump the minimum kernel version for this optional runc feature, we can do OPEN_TREE_CLONE on the host runc side and just pass the detached-mountfds to the rootfs setup code without any C code.
    • IDMAP_SOURCES_ATTR is implemented in an analogous way to the bind-mount sources code, but because it requires mount_setattr(2), the OPEN_TREE_CLONE suggestion above is actually even more applicable. We can even implement Support for ID map mounts without userns #3943 entirely in Go and apply the mount attribute in the host side of runc.
  • The session handling by SysProcAttr is probably fine to just use "normally", but I wonder if we need to take anything in particular into account.
  • As far as I can see, there is no way to join existing namespaces with SysProcAttr -- and since setns(CLONE_NEWUSER) requires us to be single-threaded this would mean we would need to keep nsexec.c. This would require a patch to stdlib, but there is an additional issue to consider:
    • setns() supports joining a subset of namespaces of a given pidfd since Linux 5.8. This is something we probably want to use, but if we use stdlib there isn't a nice way to handle the fallback (join each namespace separately). It is trivial to detect whether it is supported (pass a pidfd to setns) but due to the API of SysProcAttr, I suspect we would need to detect the support from Go (we can do this safely with CLONE_NEWUSER because it will always fail but this will fail with -EINVAL so we can't use it to detect anything -- maybe we will need to do CLONE_NEWPID in a os-thread-locked goroutine and switch back because that doesn't affect the running process...) and then tune what we pass to SysProcAttr separately.

Agree with @cyphar -- if we can do it in Go, we should do it in Go.

Overall I very much hope we'll eventually be able to do all of it in Go. For example, with cgroupfd support in the kernel (since v5.7) and golang stdlib (since 1.20), we can enter cgroups way easier.

Originally posted by @kolyshkin in #3943 (comment)

@cyphar
Copy link
Member Author

cyphar commented Aug 1, 2023

@kolyshkin Never say never, and I would love to remove all the C code from our codebase, but I'm not sure if even on newer kernels and with the newest stdlib we will be able to do that (at the very least I don't think Go has handling for newuidmap -- though this could of course be added). I can come up with a list of things to do in a separate issue if you want to have a chat about the problem. For one thing, I think that (for performance and security reasons) we almost certainly want to implement the runc userns creation for mount_setattr(2) in CGO as a (slightly unsafe) fork. CLONE_INTO_CGROUP is something we might want but as I mentioned in #3931, cgroupv2 doesn't migrate memory usage when moving cgroups, so if we use CLONE_INTO_CGROUP we will need to also move the ensure_cloned_binary() logic out of runc init -- though we can always implement in Go so this is probably not that big of a deal.

Originally posted by @cyphar in #3943 (comment)

@cyphar cyphar mentioned this issue Aug 1, 2023
4 tasks
@lifubang
Copy link
Member

I have a similar work in local, I think I have completed most of the work, once I have refactored all my code, I'll open a PR.
I have moved all stage-0 and stage-2 to go code, and only leave stage-1 in c. When I was working, I find some limitations in go:

  1. After we unshare/setns a new pid ns in cgo, the program will be crashed when entering the go routine from cgo;
  2. After we using setns to join the init process's mount ns, the program will be crashed when entering the go routine from cgo.

For the second problem, I have opened a issue in go to discuss: (golang/go#67653).
The the first problem, I can't find the core reason, but I have moved the pid ns setup in go routine, but it seems ugly.
Welcome more suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants