Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse the same rootfs for the inner and outer container #1048

Closed
apyrgio opened this issue Jan 9, 2025 · 5 comments · Fixed by #1049
Closed

Reuse the same rootfs for the inner and outer container #1048

apyrgio opened this issue Jan 9, 2025 · 5 comments · Fixed by #1049
Assignees
Labels
container enhancement New feature or request
Milestone

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Jan 9, 2025

When building the Dangerzone container image, we use multi-stage builds, where we create the inner container first, and then copy it to the outer container, under /home/dangerzone/dangerzone-image/rootfs:

COPY --from=dangerzone-image / /home/dangerzone/dangerzone-image/rootfs

This was the most clean way to use nested containers, and was implemented during #590. Now that we plan to use Debian Stable for the base image of Dangerzone (see #1046), it would be nice to merge these two container images into one, to avoid the double overhead of the Debian container image.

I have experimented with a way we can achieve this, and the following seem to work:

  1. Remove the /home/dangerzone/dangerzone-image/ dir, and write instead the OCI config to /config.json, i.e., at the root of the container image.
  2. Change the OCI config to use the container root (/) as the image bundle, instead of /home/dangerzone/dangerzone-image/rootfs.

These changes are reflected in the following diff:

--- a/dangerzone/gvisor_wrapper/entrypoint.py
+++ b/dangerzone/gvisor_wrapper/entrypoint.py
@@ -56,7 +56,7 @@ oci_config: dict[str, typing.Any] = {
             {"type": "RLIMIT_NOFILE", "hard": 4096, "soft": 4096},
         ],
     },
-    "root": {"path": "rootfs", "readonly": True},
+    "root": {"path": "/", "readonly": True},
     "hostname": "dangerzone",
     "mounts": [
         {
@@ -133,7 +142,7 @@ if os.environ.get("RUNSC_DEBUG"):
     json.dump(oci_config, sys.stderr, indent=2, sort_keys=True)
     # json.dump doesn't print a trailing newline, so print one here:
     log("")
-with open("/home/dangerzone/dangerzone-image/config.json", "w") as oci_config_out:
+with open("/config.json", "w") as oci_config_out:
     json.dump(oci_config, oci_config_out, indent=2, sort_keys=True)
 
 # Run gVisor.
@@ -150,7 +159,7 @@ if os.environ.get("RUNSC_DEBUG"):
     runsc_argv += ["--debug=true", "--alsologtostderr=true"]
 if os.environ.get("RUNSC_FLAGS"):
     runsc_argv += [x for x in shlex.split(os.environ.get("RUNSC_FLAGS", "")) if x]
-runsc_argv += ["run", "--bundle=/home/dangerzone/dangerzone-image", "dangerzone"]
+runsc_argv += ["run", "--bundle=/", "dangerzone"]
 log(
     "Running gVisor with command line: {}", " ".join(shlex.quote(s) for s in runsc_argv)
 )

My question is, do we risk something security-wise by doing this? Note that we instruct GVisor to treat the root filesystem as read-only (not that GVisor would write to these files in any case). Also, we mask sensitive system dirs (/proc, /tmp, /dev, sys, /home/dangerzone). There are some extra files that are mounted to the outer container on /etc/ and /run that we can mask, although they don't seem to give any sensitive info:

$ mount
[...]
tmpfs on /run/.containerenv type tmpfs (rw,nosuid,nodev,relatime,size=3258360k,nr_inodes=814590,mode=700,uid=1000,gid=958,inode64)
tmpfs on /etc/hostname type tmpfs (rw,nosuid,nodev,relatime,size=3258360k,nr_inodes=814590,mode=700,uid=1000,gid=958,inode64)
tmpfs on /etc/hosts type tmpfs (rw,nosuid,nodev,relatime,size=3258360k,nr_inodes=814590,mode=700,uid=1000,gid=958,inode64)
[...]
$ cat /etc/hostname
cec0ac615cbe
$ cat /etc/hosts
127.0.0.1       localhost
::1     localhost
192.168.1.23    host.containers.internal host.docker.internal
127.0.0.1       cec0ac615cbe dangerzone-doc-to-pixels-KwY1lA
$ cat /run/.containerenv

Overall, I'm inclined to go forward with this, but I'd like a second set of 👀 to validate my assumptions.

CCing @EtiennePerot for obvious reasons.

@apyrgio apyrgio added the enhancement New feature or request label Jan 9, 2025
@apyrgio apyrgio self-assigned this Jan 9, 2025
@apyrgio apyrgio added this to the 0.9.0 milestone Jan 9, 2025
@apyrgio apyrgio moved this from Todo to In Progress in Dangerzone ✨ Jan 9, 2025
apyrgio added a commit that referenced this issue Jan 14, 2025
Remove the need to copy the Dangerzone container image (used by the
inner container) within a wrapper gVisor image (used by the outer
container). Instead, use the root of the container filesystem for both
containers. We can do this safely because we don't mount any secrets to
the container, and because gVisor offers a read-only view of the
underlying filesystem

Fixes #1048
@apyrgio apyrgio moved this from In Progress to PR Review in Dangerzone ✨ Jan 14, 2025
@EtiennePerot
Copy link
Contributor

EtiennePerot commented Jan 17, 2025

There's no reason why this wouldn't work in general. The risks are those that you highlighted: the sandbox exposes more files than it previously did, but if you trust gVisor to do its job then that should not be an issue.

There is one additional important risk from exposure of the Unix Domain Socket that is used to control a running sandbox. gVisor stores per-sandbox UDSs inside the OCI root directory (passed to as runsc --root=/some/path, defaulting to $XDG_RUNTIME_DIR/runsc or fallback to /var/run/runsc if $XDG_RUNTIME_DIR is unset in the environment). gVisor sandboxes run a server on this UDS in order to respond to requests from the OCI runtime (this is how commands like podman exec work). So write access to this UDS allows an equivalent amount of control over that sandbox as the host would have. This means the ability to start new containers inside the same sandbox, stop/checkpoint the sandbox, etc. The most harmful things I can think of using that interface are either a host DoS, or the ability to write to arbitrary files in the out-of-sandbox filesystem (by triggering a checkpoint request), though the contents being written would be only semi-attacker-controller. In any case, it's in general not meant to be a security-hardened interface, and a sandbox certainly has no business having access to its own control mechanism.

To prevent this, gVisor has a --host-uds flag which defaults to none, which makes gVisor refuse to forward connection attempts from the sandbox to host UDSs. Given that Dangerzone doesn't set this flag, it currently has the --host-uds=none behavior, so it should already be safe. But I would definitely recommend explicitly setting this flag because now it would become load-bearing. To add another layer of access prevention to this UDS, I would also recommend masking the OCI root directory with a tmpfs mount, and explicitly setting runsc's --root flag for the same reason (it would now become load-bearing, and the default value is environment-variable-dependent).

In my opinion, sticking with the two-stage structure would be safer just on the basis of making fewer security assumptions. It may be possible to avoid the double-overhead of the Debian image by sharing the large-but-static subtrees of it. Deleting /usr/{bin,lib,sbin,share} and creating symlinks from /usr/{bin,lib,sbin,share} -> /dangerzone-image/usr/{bin,lib,sbin,share} might be all it takes (this covers 121 MiB out of the 127 MiB of the debian image on my machine), and it should work if the two images are of the same Debian base image. The runsc binary (currently at /usr/bin/runsc) can be moved to somewhere else such as at /runsc. It is a static binary so it has no .so dependencies.

@apyrgio
Copy link
Contributor Author

apyrgio commented Jan 17, 2025

Truth is, I originally tried the symlink approach, but I did it the other way around, i.e., /dangerzone-image/usr/{bin,lib,sbin,share} -> /usr/{bin,lib,sbin,share} and I encountered some errors (don't have them handy I'm afraid). I may give it a shot again and report here.

Also, good to know about the UDS and other potentially sensitive files that gVisor may create. I had a concern like that in mind, and turns out we already use runsc ... --root /home/dangerzone/.containers, and /home/dangerzone is masked, so hopefully all gVisor-related files will be masked:

runsc_argv = [
"/usr/bin/runsc",
"--rootless=true",
"--network=none",
"--root=/home/dangerzone/.containers",

{
"destination": "/home/dangerzone",
"type": "tmpfs",
"source": "tmpfs",
"options": ["nosuid", "noexec", "nodev"],
},

What bothers me is that if an assumption of ours change, e.g., Docker decides to mount a sensitive file somewhere, then we may not detect it. I'll think more about this and report back.

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Jan 18, 2025

Truth is, I originally tried the symlink approach, but I did it the other way around, i.e., /dangerzone-image/usr/{bin,lib,sbin,share} -> /usr/{bin,lib,sbin,share} and I encountered some errors (don't have them handy I'm afraid). I may give it a shot again and report here.

Right, it won't work in that direction because they would appear as symlinks that lead nowhere when trying to resolve them with the view of the filesystem that the inner container has. gVisor doesn't resolve these symlinks on the host when the sandbox attempts to read them (nor could it anyway, because it isolates its own view of the out-of-sandbox filesystem such that it can no longer access files outside of the rootfs directory and explicitly-configured mountpoints).

we already use runsc ... --root /home/dangerzone/.containers, and /home/dangerzone is masked, so hopefully all gVisor-related files will be masked

Should work.

apyrgio added a commit that referenced this issue Jan 21, 2025
Remove the need to copy the Dangerzone container image (used by the
inner container) within a wrapper gVisor image (used by the outer
container). Instead, use the root of the container filesystem for both
containers. We can do this safely because we don't mount any secrets to
the container, and because gVisor offers a read-only view of the
underlying filesystem

Fixes #1048
apyrgio added a commit that referenced this issue Jan 21, 2025
Remove the need to copy the Dangerzone container image (used by the
inner container) within a wrapper gVisor image (used by the outer
container). Instead, use the root of the container filesystem for both
containers. We can do this safely because we don't mount any secrets to
the container, and because gVisor offers a read-only view of the
underlying filesystem

Fixes #1048
apyrgio added a commit that referenced this issue Jan 21, 2025
Mask some paths of the outer container in the OCI config of the inner
container. This is done to avoid leaking any sensitive information from
Podman / Docker / gVisor, since we reuse the same rootfs

Refs #1048
apyrgio added a commit that referenced this issue Jan 21, 2025
Mask some paths of the outer container in the OCI config of the inner
container. This is done to avoid leaking any sensitive information from
Podman / Docker / gVisor, since we reuse the same rootfs

Refs #1048
apyrgio added a commit that referenced this issue Jan 21, 2025
Mask some paths of the outer container in the OCI config of the inner
container. This is done to avoid leaking any sensitive information from
Podman / Docker / gVisor, since we reuse the same rootfs

Refs #1048
@apyrgio
Copy link
Contributor Author

apyrgio commented Jan 21, 2025

Etienne's assumption that only /usr is required for Dangerzone to run inside gVisor is pretty much spot on. Based on my tests, I'd add /etc/ to that, for LibreOffice settings and font config.

Since it's just two dirs, and we can mask all the rest, well, I did just that. I went ahead and created a mount entry in our OCI config file for every root dir that is listed in the Filesystem Hierarchy Standard (FHS), except for:

  • /usr
  • /lib (due to /usr/lib -> /lib),
  • /bin (due to /usr/bin -> /bin), and
  • /etc

See 005d742 (unmerged commit, hash may change) for details.

I think I prefer that approach over the symlinks one for simplicity reasons. I'm aware it relies on gVisor properly implementing masking, and respecting container boundaries, but that's something we just have to rely on at the end of the day (and a separate container probably wouldn't help there).

@apyrgio
Copy link
Contributor Author

apyrgio commented Jan 23, 2025

I also gave a shot at the symlinks route. It may be that my Linux fu is rusty, but I found that symlinking /usr to another place in a Dockerfile is hard. It's simpler to do so in a rescue / live USB system, but podman build is another beast.

Anyways, check out 414efb6. Defense in depth -wise, it's better to go with the symlinks approach, but I can't attest to the maintainability of this option. Note that the way I've managed to make it work does not add any size overhead, so both approaches create a small-ish image tarball (~445 MiB)

apyrgio added a commit that referenced this issue Jan 23, 2025
Mask some paths of the outer container in the OCI config of the inner
container. This is done to avoid leaking any sensitive information from
Podman / Docker / gVisor, since we reuse the same rootfs

Refs #1048
apyrgio added a commit that referenced this issue Jan 27, 2025
Mask some paths of the outer container in the OCI config of the inner
container. This is done to avoid leaking any sensitive information from
Podman / Docker / gVisor, since we reuse the same rootfs

Refs #1048
apyrgio added a commit that referenced this issue Jan 27, 2025
Update our Dockerfile and entrypoint script in order to reuse the /usr
dir in the inner and outer container image.

Refs #1048
@github-project-automation github-project-automation bot moved this from PR Review to Done in Dangerzone ✨ Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
container enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants