Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Use runc-dmz to defeat CVE-2019-5736 #3983

Closed
wants to merge 4 commits into from

Conversation

lifubang
Copy link
Member

Close: #3973

For CVE-2019-5736, we have to make sure we can't write the original runc binary from a container.
So we use memfd_create to load runc to the memory and seal it. But the size of runc binary file is about 13M, if we concreate lots of containers in a short period, it will eat host's memory.

We want to reduce the size of runc binary file, but it's very diffculty.
There is another way, before the container's entrypoint start, if we can't see runc, then there is no need to protect runc. So we need to make runc exit before the container task start. The only way is to replace it with an other process, we call it runc-dmz, it means a temp zone between runc and container.

The runc-dmz is very simple:

#include <unistd.h>

extern char **environ;

int main(int argv, char **args) {
    if (argv > 0) {
        return execve(args[0], args, environ);
    }
    return 0;
}

We can make it as a static binary, the size is 852kb.
I think 852kb is more smaller than 14mb, so it will save many memory when concreating lots of container at a time.

But I don't know whether it will cause some other problems.

Makefile Outdated
runc: runc-dmz runc-bin

runc-dmz:
gcc -o runc-dmz -static contrib/cmd/runc-dmz/runc-dmz.c
Copy link
Member

@rata rata Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: $(CC) instead of directly calling gcc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, as it will particularly ease the cross compilation.

@AkihiroSuda
Copy link
Member

CI failing

not ok 165 runc run as user with no exec bit but CAP_DAC_OVERRIDE set
# (in test file tests/integration/start_hello.bats, line 64)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run test_busybox (status=1):
# time="2023-08-15T14:40:54Z" level=error msg="runc run failed: unable to start container process: exec: \"/bin/echo\": permission denied"
# runc run test_busybox (status=255):
#

runc: runc-dmz runc-bin

runc-dmz:
$(CC) -o runc-dmz -static contrib/cmd/runc-dmz/runc-dmz.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And CFLAGS

runc: runc-dmz runc-bin

runc-dmz:
$(CC) -o runc-dmz -static contrib/cmd/runc-dmz/runc-dmz.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to make this a hard dependency, it should be placed out of “contrib”

return execve(args[0], args, environ);
}
return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can optimize the binary footprint by eliminating glibc and executing “syscall” instruction directly, but that may happen in follow-up PRs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we build this binary with musl, you get a 13KB binary. With glibc it's 1.1MB. We will need to adjust our build system for this -- but there is a separate issue that distributions might find building a single binary with musl quite difficult. Speaking from the SUSE side, I don't think we even package musl for SLES so we would need to build with glibc (meaning the binary would be 1.1MB in size -- almost 100 times larger).

As you say, this can be done in a future PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting this in asm will be much easier than complicating libc dependency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there is a silver bullet here:

  • with glibc, you get a big binary.
  • with musl, you get a small binary but this can be painful to build against musl from some distribution.
  • with pure assembly, you will need as many files as supported architecture and choosing which file to build depending on the target architecture.

Maybe using syscall directly can be a middle ground, but you will need to defined the execve syscall number depending on the target architecture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syscall doesn't reduce the binary size unfortunately (also, libcs provide the syscall numbers for all syscalls in <sys/syscall.h>).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad :(.
Let me know regarding the assembly files, I can give a hand for amd64 and arm64!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gist.github.com/Zheaoli/f37bc1fb04917fdfac36d644ee69f7e9
I made a demo for amd64. The binary file size is 4.9k.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my contribution for arm64:

From 95defc03b7ec526c9c966735ebeb5bac5a9a34a0 Mon Sep 17 00:00:00 2001
From: Francis Laniel <[email protected]>
Date: Wed, 30 Aug 2023 13:06:27 +0200
Subject: [PATCH] runc-dmz: Add arm64 assembly code.

Signed-off-by: Francis Laniel <[email protected]>
---
 contrib/cmd/runc-dmz/runc-dmz-arm64.S | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)
 create mode 100644 contrib/cmd/runc-dmz/runc-dmz-arm64.S

diff --git a/contrib/cmd/runc-dmz/runc-dmz-arm64.S b/contrib/cmd/runc-dmz/runc-dmz-arm64.S
new file mode 100644
index 00000000..25387e61
--- /dev/null
+++ b/contrib/cmd/runc-dmz/runc-dmz-arm64.S
@@ -0,0 +1,25 @@
+.equ EXECVE_SYSCALL_NUMBER, 221
+.equ EXIT_SYSCALL_NUMBER, 93
+
+.text
+.global _start
+_start:
+	ldr x3, [sp] // *sp contains argc, so x3 = argc
+	cmp x3, #0
+	bgt execve
+	mov x0, #0
+
+exit:
+	mov x8, EXIT_SYSCALL_NUMBER
+	svc #0
+
+execve:
+	add x1, sp, #16 // x1 = &argv[1]
+	ldr x0, [x1] // x0 = argv[1]
+	# argv[argc] is NULL and environment starts at argv[argc + 1].
+	mov x4, #8 // x4 = 8
+	mul x3, x3, x4 // x3 = (argc + 1) * 8
+	add x2, x1, x3 // x3 = &argv[argc + 1], i.e. environment.
+	mov x8, EXECVE_SYSCALL_NUMBER
+	svc #0
+	b exit
-- 
2.34.1

I tested it both through emulation and virtualization:

francis@pwmachine:~/Codes/kinvolk/runc/contrib/cmd/runc-dmz$ aarch64-linux-gnu-as -g runc-dmz-arm64.S -o runc-dmz-arm64.o                                  (95defc03...) %
francis@pwmachine:~/Codes/kinvolk/runc/contrib/cmd/runc-dmz$ aarch64-linux-gnu-ld -static runc-dmz-arm64.o                                                 (95defc03...) %
francis@pwmachine:~/Codes/kinvolk/runc/contrib/cmd/runc-dmz$ du -sh a.out                                                                                  (95defc03...) %
4,0K    a.out
francis@pwmachine:~/Codes/kinvolk/runc/contrib/cmd/runc-dmz$ file a.out                                                                                    (95defc03...) %
a.out: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, with debug_info, not stripped
francis@pwmachine:~/Codes/kinvolk/runc/contrib/cmd/runc-dmz$ qemu-aarch64-static ./a.out /bin/ls -al                                                       (95defc03...) %
total 32
drwxrwxr-x 2 francis francis 4096 août  30 13:09 .
drwxrwxr-x 7 francis francis 4096 août  29 17:23 ..
-rw-rw-r-- 1 francis francis 1213 août  30 13:06 0001-runc-dmz-Add-arm64-assembly-code.patch
-rwxrwxr-x 1 francis francis 1832 août  30 13:09 a.out
-rw-rw-r-- 1 francis francis 2168 août  30 13:09 runc-dmz-arm64.o
-rw-rw-r-- 1 francis francis  524 août  30 13:06 runc-dmz-arm64.S
-rw-rw-r-- 1 francis francis  168 août  29 17:23 runc-dmz.c
-rw-rw-r-- 1 francis francis  700 août  29 18:59 runc-dmz.s
# Inside arm64 VM:
root@vm-arm64:~/share/kinvolk/runc/contrib/cmd/runc-dmz# lscpu | head -1
Architecture:                       aarch64
root@vm-arm64:~/share/kinvolk/runc/contrib/cmd/runc-dmz# ./a.out /bin/ls -al
total 32
drwxrwxr-x 2 1000 1000 4096 Aug 30 11:09 .
drwxrwxr-x 7 1000 1000 4096 Aug 29 15:23 ..
-rw-rw-r-- 1 1000 1000 1213 Aug 30 11:06 0001-runc-dmz-Add-arm64-assembly-code.patch
-rwxrwxr-x 1 1000 1000 1832 Aug 30 11:09 a.out
-rw-rw-r-- 1 1000 1000  524 Aug 30 11:06 runc-dmz-arm64.S
-rw-rw-r-- 1 1000 1000 2168 Aug 30 11:09 runc-dmz-arm64.o
-rw-rw-r-- 1 1000 1000  168 Aug 29 15:23 runc-dmz.c
-rw-rw-r-- 1 1000 1000  700 Aug 29 16:59 runc-dmz.s

Feel free to apply the patch! In any case, it was funny to write arm64 assembly from scratch as it was a long time I did not do that.
If you see any problem with the code, share your feedback and I will address your comments!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe there still have some design problems for runc-dmz proposal.
Please see: #3987 (comment)
But I can't confirm, if you can check there is no problem for runc-dmz propsal, it will be appreciated. Because I don't want to waste contributor's time for a wrong proposal.

panic(err)
}
nowPath := getExeDir()
dmzMake, err := exec.Command("gcc", "-o", filepath.Join(nowPath, "integration.test-dmz"), "-static", filepath.Join(rootDir, "contrib/cmd/runc-dmz/runc-dmz.c")).CombinedOutput()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn’t be here, and the dmz binary should be just in the PATH?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just embed the dmz binary into the main binary

nowPath := getExeDir()
dmzMake, err := exec.Command("gcc", "-o", filepath.Join(nowPath, "nsenter.test-dmz"), "-static", filepath.Join(rootDir, "contrib/cmd/runc-dmz/runc-dmz.c")).CombinedOutput()
if err != nil {
panic(fmt.Errorf("make runc-dmz error %w (output: %s)", err, dmzMake))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dmz binary should be just preinstalled in the PATH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just embed the dmz binary into the main binary


extern char **environ;

int main(int argv, char **args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int main(int argv, char **args)
int main(int argc, char **argv)

if (argv > 0) {
return execve(args[0], args, environ);
}
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be non-zero

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, though the binary should be embedded in the runc binary using //go:embed and (possibly in a future PR) all of this code should be moved to Go. I suspect embedding the binary text would make it far easier to do all of this in Go, so maybe the patch in #3953 doing that should be merged into this PR.

However, since the size of runc-dmz under glibc is still 1.1MB, I need to ask whether this actually fixes the original issue -- this would increase the limit of waiting runc creates to 10x the original (~3k). A musl-built binary is much smaller (13KB) and so you can run 1000x more runc creates (but distributions might struggle to build musl binaries this way).

if err != nil {
return err
}
dmzArgs := []string{entryPoint}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this variable isn't necessary, you can just do append([]string{...}, ...).

if err != nil {
return err
}
dmzArgs := []string{entryPoint}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return execve(args[0], args, environ);
}
return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we build this binary with musl, you get a 13KB binary. With glibc it's 1.1MB. We will need to adjust our build system for this -- but there is a separate issue that distributions might find building a single binary with musl quite difficult. Speaking from the SUSE side, I don't think we even package musl for SLES so we would need to build with glibc (meaning the binary would be 1.1MB in size -- almost 100 times larger).

As you say, this can be done in a future PR.

return cloned;

if (fetchve(&argv) < 0)
return -EINVAL;

execfd = clone_binary();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary text should be embedded into the runc binary (using //go:embed) and all of this code should be moved to the Go code (like I did in #3953). But if you like, I can do that in a follow-up PR (or I can take this code and the memfd code from #3953 and make a single PR to clean this all up in one go).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or I can take this code and the memfd code from #3953 and make a single PR to clean this all up in one go

Yes, I think you can take this if you think this proposal worth to try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll open a separate PR with this code and combine it with the cloned-binary patch. Thanks for working on this, I like the idea a lot!

int main(int argv, char **args)
{
if (argv > 0) {
return execve(args[0], args, environ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering about ensuring the first argument, i.e. the executed binary is runc.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first argument is not runc -- it's the container pid1 program.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK! Thank you for the precision! I still need to polish my runc knowledge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first argument is not runc -- it's the container pid1 program.

If you try to run the C binary directly, I think the first argument of args would be runc-dmz?

@kolyshkin
Copy link
Contributor

Like the idea! I tried compiling the C prog with dietlibc, ended up with a 9.5K static stripped binary.

I'm not sure how distro vendors (who doesn't have musl or dietlibc) can package this effectively, but we need to find a way to help them all.

@cyphar
Copy link
Member

cyphar commented Aug 18, 2023

I agree with @AkihiroSuda that asm versions for the common architectures is probably the simplest option -- but we can do that in a separate PR to the initial implementation. Most distributions won't be able to package pre-built binaries, so if building with a custom libc is difficult in their build system, they won't be able to do much to reduce the size of the C program.

@lifubang
Copy link
Member Author

not ok 165 runc run as user with no exec bit but CAP_DAC_OVERRIDE set

Maybe this is the big problem of this proposal.

The capabilities of process inherit for execve is very complex. Maybe we should transfer all capabilities data to runc-dmz?

Please see https://man7.org/linux/man-pages/man7/capabilities.7.html

Transformation of capabilities during execve()
       During an [execve(2)](https://man7.org/linux/man-pages/man2/execve.2.html), the kernel calculates the new capabilities
       of the process using the following algorithm:

           P'(ambient)     = (file is privileged) ? 0 : P(ambient)

           P'(permitted)   = (P(inheritable) & F(inheritable)) |
                             (F(permitted) & P(bounding)) | P'(ambient)

           P'(effective)   = F(effective) ? P'(permitted) : P'(ambient)

           P'(inheritable) = P(inheritable)    [i.e., unchanged]

           P'(bounding)    = P(bounding)       [i.e., unchanged]

       where:

           P()    denotes the value of a thread capability set before
                  the [execve(2)](https://man7.org/linux/man-pages/man2/execve.2.html)

           P'()   denotes the value of a thread capability set after the
                  [execve(2)](https://man7.org/linux/man-pages/man2/execve.2.html)

           F()    denotes a file capability set

@cyphar
Copy link
Member

cyphar commented Aug 18, 2023

Yeah, that's unfortunate. But we can't do the capability configuration in runc-dmz because the seccomp profile might block the necessary syscalls, and after execve we no longer have the non-dumpable bit set, which means that container processes would be able to ptrace runc-dmz and thus escalate privileges by injecting code into runc-dmz before it drops its privileges.

Instead, I think the test should have included CAP_DAC_OVERRIDE in the inheritable set. When you do docker run --cap-add CAP_DAC_OVERRIDE the capability gets added to all of the capability sets. WDYT @kolyshkin?

In fact, the runtime-spec having the inheritable set be explicitly configurable is quite strange now that I think about it -- because the runtime has to configure the capabilities and then exec the container process, configuring a strange inheritable configuration would only result in weird runtime behaviour. We cannot configure the inheritable set for the container pid1, which is what I suspect users would expect -- not the current behaviour where dropping a capability in inheritable actually means the pid1 won't get the effective capability at all. Obviously it's probably too late to change any of this behaviour (except maybe making inheritable a no-op?) but we can just change the test.

EDIT: It seems most of the above is wrong -- adding just inheritable doesn't work (as expected given the table @lifubang posted), you need to also add the ambient caps to make the test pass. In addition, looking at the way Docker sets our capabilities, it seems that they also only set the bounding, permitted, and effective capabilities to the requested ones. It's getting a bit late so I might just be tired, but it seems very odd to me that the current behaviour works at all. I'll read through the kernel code next week to figure out what is going on...

EDIT 2: My first analysis is correct -- Docker has broken handling of capabilities for non-root users. Root has special handling in the kernel in handle_privileged_root (not mentioned in man 7 capabilities) which basically means that the permissive set P(inheritable)|P(bset) is used as the permitted and effective sets (ignoring a lot of complexity). For unprivileged users, the "new" ambient set causes these to all be zeroed. You can verify this with today's Docker:

% docker run -u root --cap-drop all --cap-add cap_dac_override alpine cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000000000000002
CapEff: 0000000000000002
CapBnd: 0000000000000002
CapAmb: 0000000000000000
% docker run -u 0 --cap-drop all --cap-add cap_dac_override alpine cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000002
CapAmb: 0000000000000000

This behaviour is identical with runc-dmz, it's just that having the intermediate binary means that we run with the same privileges as the actual container process (that is to say, without the capabilities we expected to pass). Aside from running as root, AFAICS it's

One alternative solution is to set runc-dmz to dynamically have the correct capabilities based on the container configuration but this seems like an incredibly bad idea. I suspect the correct solution is to update Docker's handling of capabilities to correctly handle ambient capabilities for unprivileged users. I can open a bug report against Docker... I'm surprised nobody has root-caused this issue before... here is a forum post where someone worked around the issue without noticing the behaviour from Docker is wrong.

EDIT 3: Or, we could remove the ambient capabilities in runc-dmz. However this will increase the size of runc-dmz (we shouldn't call setcap(2) directly because the API is not supposed to be used outside of libcap...). But, there might be seccomp rules blocking the capability syscalls.

@cyphar
Copy link
Member

cyphar commented Aug 20, 2023

#3987 is a port of this to Go, along with quite a few other things improving the cloning logic.

@lifubang
Copy link
Member Author

According to some security reasons, this proposal is not worth to implement. Thanks everybody’s efforts.

@lifubang lifubang closed this Aug 31, 2023
@lifubang
Copy link
Member Author

According to some security reasons, this proposal is not worth to implement. Thanks everybody’s efforts.

To anyone who reads in here, this proplsal finally has taken in some conditions, please see #3987 (comment) . Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider revert #3931 nsexec: cloned_binary: remove bindfd logic entirely
7 participants