-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsexec: moving as much as we can to Go #3951
Comments
Originally posted by @cyphar in #3943 (comment) |
I have a similar work in local, I think I have completed most of the work, once I have refactored all my code, I'll open a PR.
For the second problem, I have opened a issue in go to discuss: (golang/go#67653). |
Here is a list of things that nsexec does today, and a description of possible ideas and challenges to moving them to Go:
syscall.SysProcAttr
today. There is code innsexec.c
that fixes very old kernel bugs (such as userns ownership being broken on old RHEL 6 kernels) which I'm sure Go stdlib doesn't care about (and maybe we shouldn't either). However there are some things that I suspect are not ideal:newuidmap
fallbacks supported by Go stdlib? I suspect not. Though obviously we could write PRs to add them to stdlib...CLONE_NEWCGROUP
far later during setup than the other namespaces, but this is actually a vestigial implementation detail of the original cgroupns code (which incorrectly added a synchronisation step to "ensure"runc init
was in the correct cgroup). The logic probably should've been removed in 5110bd2.CLONE_INTO_CGROUP
(SysProcAttr.CgroupFD
, which should make all container nsenter-related operations faster because it avoids taking a bunch of locks in cgroup-core), but with nsexec: cloned_binary: remove bindfd logic entirely #3931 we need thememfd_create()
and/proc/self/exe
copy to be executed outside of the container cgroup. This is easy to fix by just doing the clone in Go code (we could even make it part of therunc init
fork+exec -- to remove one extra exec).open_tree(OPEN_TREE_CLONE)
allows us to get around this, meaning that if we bump the minimum kernel version for this optional runc feature, we can doOPEN_TREE_CLONE
on the host runc side and just pass the detached-mountfds to the rootfs setup code without any C code.IDMAP_SOURCES_ATTR
is implemented in an analogous way to the bind-mount sources code, but because it requiresmount_setattr(2)
, theOPEN_TREE_CLONE
suggestion above is actually even more applicable. We can even implement Support for ID map mounts without userns #3943 entirely in Go and apply the mount attribute in the host side of runc.SysProcAttr
is probably fine to just use "normally", but I wonder if we need to take anything in particular into account.SysProcAttr
-- and sincesetns(CLONE_NEWUSER)
requires us to be single-threaded this would mean we would need to keepnsexec.c
. This would require a patch to stdlib, but there is an additional issue to consider:setns()
supports joining a subset of namespaces of a given pidfd since Linux 5.8. This is something we probably want to use, but if we use stdlib there isn't a nice way to handle the fallback (join each namespace separately). It is trivial to detect whether it is supported (pass a pidfd tosetns
) but due to the API ofSysProcAttr
, I suspect we would need to detect the support from Go (we can do this safely withCLONE_NEWUSER
because it will always fail but this will fail with-EINVAL
so we can't use it to detect anything -- maybe we will need to doCLONE_NEWPID
in a os-thread-locked goroutine and switch back because that doesn't affect the running process...) and then tune what we pass toSysProcAttr
separately.Originally posted by @kolyshkin in #3943 (comment)
The text was updated successfully, but these errors were encountered: