-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default “defaultErrnoRet” breaks the ability for hosts to run more recent container images. #1266
Comments
tl;dr: It would be nice to fix this in the spec, however solving this is more complicated than you expect and libseccomp is missing necessary features (not to mention this should be a problem solved by libseccomp itself IMHO). The simplest solution to the immediate problem is to fix Docker's profile so that it uses We discussed switching to The roadblock we ran into is that libseccomp didn't provide enough facilities to make this work seamlessly. seccomp/libseccomp#11 and seccomp/libseccomp#286 are the upstream issues where this topic was discussed. In short, there are a few key issues (there are also outlined in the runc commit that added the patchbpf code):
When working on the "hacky" runc code to work around these problems, I tried many approaches and rewrote the code several times. I tried to implement a minimum-kernel-version setup several times (including coming up with very complicated mechanisms to generate inverse rules) and came to the conclusion that it is not possible to implement this with libseccomp currently, and that this needs to be implemented by libseccomp itself. I also tried to come up with a different patching system that just patched The best solution we have at the moment is for runtimes to use Regarding the ARM and PPC issues you mentioned -- we did fix a bug related to ppc64le recently that might also have affected ARM. Please verify whether that patch fixes the issue you mentioned, and if not please submit a bug report to runc directly. While I don't like that code, having a maximum kernel version would require very similar code unless we add support for all of this to libseccomp directly. The reason we implemented it in runc without pushing it to the runtime-spec is because the libseccomp folks said they were working on it and so we opted to wait for libseccomp to be ready before we define the right behaviour in the spec.
In theory you can already get this information from BTF if you really want to. However, you don't actually need the current set of syscalls on the running kernel to solve this problem, you need historical data so you can set a minimum kernel version. What we need is for libseccomp to have the ability to specify a minimum kernel version which will cause libseccomp to replace its final generic |
Currently the spec defines the default
defaultErrnoRet
to beEPERM
, this is troublesome and causes issues when running newer containers on hosts running an older kernel/userspace where the libseccomp version might not be aware of some of the syscalls used by the container. There have been many issues reported about syscalls getting anEPERM
return code instead ofENOSYS
when not available and breaking the user-space inside the container. Below is an example of such reports:[garden/seccomp] Unable to run 32-bit binaries in concourse containers concourse/concourse#7471
Incorrect default errnoRet ? #1122
seccomp filter should return ENOSYS for unknown syscalls runc#2151
runc does have some hacky code in order to try and figure out if a syscall is supported or not but the method is not always reliable and we have seen at Canonical reports of Ubuntu Noble containers breaking under Ubuntu Jammy hosts on ARM as well as PPC .
This issue currently affects all fixed release distros which also happen to be the most popular distros for running containers, and for these distros updating libseccomp for every new syscall provides an unnecessary risk of regression and defeats the whole purpose of a fixed released distro.
I understand that changing the current
defaultErrnoRet
toENOSYS
may also cause regressions, however it also needs to be acknowledged that theEPERM
default value was an oversight and that the OCI spec is fundamentally not compatible with fixed release distros which are the most popular distros for running containers. It also violates one of the most fundamental rules/expectations of containers which is to be able to run any version of a user-space whether it is older or newer than the version on that host.Having said that, I believe there is a way to satisfy both camps. A list of the currently available syscalls (up to kernel 6.10 as of the writing of this post) can be compiled and be manually set as
EPERM
for those who were relying on the defaultEPERM
return value while having the others unchanged in the seccomp profiles. This means that when changingdefaultErrnoRet
to beENOSYS
all previously available syscalls will still returnEPERM
while newer added ones or even older ones that are defined in the seccomp profile but not known by the libseccomp package on the host will return the correctENOSYS
.As an example, this can be expressed as the following in the runtime spec:
defaultErrnoRet
(uint, OPTIONAL) - the errno return code to use.Some actions like
SCMP_ACT_ERRNO
andSCMP_ACT_TRACE
allow to specify the errno code to return.When the action doesn't support an errno, the runtime MUST print and error and fail.
If not specified then its default value is
EPERM
for syscalls prior to kernel 6.11 andENOSYS
for future ones.This means that anyone currently using the spec will see no change to their containers since they are all using syscalls from linux 6.10 and below. But it also means that newer containers using post 6.10 syscalls will return the expected
ENOSYS
error limiting the issue.Also the spec should define the behavior to follow if the syscall name is not known to the host. I believe the spec should explicitly define
ENOSYS
for such syscalls, and I am planning on working on a kernel driver that would expose to user-space the list of supported syscalls by the kernel, making it easier to determine the return value of each syscall.The text was updated successfully, but these errors were encountered: