Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import and compute composefs' fs-verity in a multi-staged build #34

Closed
travier opened this issue Nov 13, 2024 · 14 comments · Fixed by #36
Closed

Import and compute composefs' fs-verity in a multi-staged build #34

travier opened this issue Nov 13, 2024 · 14 comments · Fixed by #36

Comments

@travier
Copy link
Member

travier commented Nov 13, 2024

In order to be able to run the entire build process from a multi-staged build via a single Containerfile build, we need cfsctl to read the OCI content using a --mount/--from option like in https://gitlab.com/fedora/bootc/base-images/-/blob/main/Containerfile.

cfsctl would read the entire container archive, generate the composefs dumpfile, pass it to mkcomposefs and finally get the fs-verity of the resulting EROFS.

Then this layer could build the UKI using this fs-verity-hash.

As we need this hash as a label (i.e. outside of the container), we would then store that as a file or output that on stdout at the end and another step in the build process would add the hash as a label (as it's not possible to add labels with values generated from stages right AFAIK).

@allisonkarlitskaya
Copy link
Collaborator

containers/buildah#5837 would be a huge help here, but we could also do it by scanning the filesystem in a normal bind mount, and hopefully we can get the same result.

@allisonkarlitskaya
Copy link
Collaborator

I have a working branch for this. The two composefses are equal with one major issue: when we create a composefs from the image via layer tarballs we use the epoch as the mtime for / because / never explicitly appears in a layer. The mtime of the root of a container image as inspected inside of podman appears to be equal to the time at which podman unpacked that layer (ie: at podman build or podman pull time: nothing deterministic).

Using the epoch feels slightly terrible but it's better than something essentially chosen at random.

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Nov 14, 2024

Another option would be to specify that we use the Created time in the container config. It's not clear how we'd get access to that from inside of the container, though.

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Nov 14, 2024

Or here are three variations on the same theme:

  • the mtime of the root directory is set to be the same as the most recent mtime of the files present in the root directory; or
  • the mtime of the root directory is set to be the same as the most recent mtime in the entire filesystem; or
  • each directory has its mtime set to the most recent mtime of any entry. this would apply transitively (so generalizes the first and ends up also satisfying the second in the process)

The last option has a certain appeal but it's also the most "destructive" of all the options and it has an additional problem: I'm not sure how well it would play with things like the various icon/mime/etc cache files which are supposed to be newer than the directories that contain them. We'd have to read those specs carefully, and hope that there's not some other spec that we didn't consider which we accidentally break.

On balance I like #2. "Most recent mtime in the entire filesystem" probably ends up being a pretty good proxy for "container creation timestamp".

@cgwalters
Copy link
Collaborator

Hmm shouldn't it just be the mtime of the tar entry that contained the root?

@allisonkarlitskaya
Copy link
Collaborator

Hmm shouldn't it just be the mtime of the tar entry that contained the root?

That's exactly the problem. There isn't one.

@allisonkarlitskaya
Copy link
Collaborator

There's actually an additional issue here: we might reasonably assume that the owner of the root directory is root:root but it's not clear what the permissions should be. I've hardcoded them to 0755 based on gut instinct, and this is what they are on my (ostree) booted system and /sysroot, but podman actually picks 0555 for this purpose, so that's one more difference we have to figure out how to paper over.

We definitely need solid specs around these things. They impact the determinism of the process of producing the image.

@travier
Copy link
Member Author

travier commented Nov 15, 2024

the mtime of the root directory is set to be the same as the most recent mtime of the files present in the root directory

It think that's the most reasonable one. It should be fast to find out. Another option is to set it to the oldest mtime for the files present in the root directory.

Note that we don't want to change any mtime of other files. This has implications for Python bytecode for example: ostreedev/ostree#1469

@travier
Copy link
Member Author

travier commented Nov 15, 2024

I've hardcoded them to 0755 based on gut instinct, and this is what they are on my (ostree) booted system and /sysroot, but podman actually picks 0555 for this purpose, so that's one more difference we have to figure out how to paper over.

555 is nice as it reflects that even root can not change those files. If podman uses it then I guess it should be fine? 755 should be just fine as well.

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Nov 15, 2024

It think that's the most reasonable one. It should be fast to find out. Another option is to set it to the oldest mtime for the files present in the root directory.

I went with "newest file overall". It's also easy to find out. Note that every time we're importing a filesystem we're scanning the entire thing. Keeping this one extra tiny bit of state around ends up not being difficult at all.

555 is nice as it reflects that even root can not change those files. If podman uses it then I guess it should be fine? 755 should be just fine as well.

I feel like special-casing as little as possible, and I also don't feel like getting into a fight with podman, so 0555 it is.

@cgwalters
Copy link
Collaborator

That's exactly the problem. There isn't one.

There is in the fedora-bootc image:

tar tvf fedora-oci/blobs/sha256/2dbcdb8a850237b5ccdfeadb58518e23fc4d0e969d475835829935d5010d080d | head
drwxr-xr-x 0/0               0 1969-12-31 19:00 ./
drwxr-xr-x 0/0               0 1969-12-31 19:00 sysroot
drwxr-xr-x 0/0               0 1969-12-31 19:00 sysroot/ostree

But yeah, there isn't one in quay.io/fedora/fedora or docker.io/library/debian. I think we should encourage this though, especially for bootable containers because another thing that arises here if we just make up an entry for / is what the selinux label is (for sealed images we shouldn't be computing it client side).

So let's honor the tar stream if it exists, otherwise fall back.

@allisonkarlitskaya
Copy link
Collaborator

So let's honor the tar stream if it exists, otherwise fall back.

Agree complete. I'm about to push a branch, and this is exactly what I wrote in the documentation that I added.

@cgwalters
Copy link
Collaborator

(Whether / is present or not is also something that would be important for "canonical tar")

@allisonkarlitskaya
Copy link
Collaborator

I went with "newest file overall". It's also easy to find out. Note that every time we're importing a filesystem we're scanning the entire thing. Keeping this one extra tiny bit of state around ends up not being difficult at all.

This turned out to be naive. Simply taking the newest mtime from each tar entry is definitely wrong because it doesn't take whiteouts into account (both the whiteout itself, plus the file that it deletes).

allisonkarlitskaya added a commit that referenced this issue Nov 15, 2024
src/fs.rs contains code for writing the in-memory filesystem tree to a
directory on disk, so let's add the other direction: converting an
on-disk directory to an in-memory filesystem tree.  This will let us
scan container images from inside containers.  This is necessary because
we can't get access to the OCI layer tarballs during a container build
(even from a later stage in a multi-stage build) but we can bindmount
the root filesystem.

See containers/buildah#5837

With our recent changes to how we handle metadata on the root directory
we should now be producing the same image on the inside and the outside,
which gives us a nice way to produce a UKI with a built-in `composefs=`
command-line parameter.

Add a new 'unified' example.  This does the container build as a single
`podman build` command with no special arguments.

Closes #34
allisonkarlitskaya added a commit that referenced this issue Nov 15, 2024
src/fs.rs contains code for writing the in-memory filesystem tree to a
directory on disk, so let's add the other direction: converting an
on-disk directory to an in-memory filesystem tree.  This will let us
scan container images from inside containers.  This is necessary because
we can't get access to the OCI layer tarballs during a container build
(even from a later stage in a multi-stage build) but we can bindmount
the root filesystem.

See containers/buildah#5837

With our recent changes to how we handle metadata on the root directory
we should now be producing the same image on the inside and the outside,
which gives us a nice way to produce a UKI with a built-in `composefs=`
command-line parameter.

Add a new 'unified' example.  This does the container build as a single
`podman build` command with no special arguments.

Closes #34
allisonkarlitskaya added a commit that referenced this issue Nov 15, 2024
src/fs.rs contains code for writing the in-memory filesystem tree to a
directory on disk, so let's add the other direction: converting an
on-disk directory to an in-memory filesystem tree.  This will let us
scan container images from inside containers.  This is necessary because
we can't get access to the OCI layer tarballs during a container build
(even from a later stage in a multi-stage build) but we can bindmount
the root filesystem.

See containers/buildah#5837

With our recent changes to how we handle metadata on the root directory
we should now be producing the same image on the inside and the outside,
which gives us a nice way to produce a UKI with a built-in `composefs=`
command-line parameter.

Add a new 'unified' example.  This does the container build as a single
`podman build` command with no special arguments.

Closes #34

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit that referenced this issue Nov 15, 2024
src/fs.rs contains code for writing the in-memory filesystem tree to a
directory on disk, so let's add the other direction: converting an
on-disk directory to an in-memory filesystem tree.  This will let us
scan container images from inside containers.  This is necessary because
we can't get access to the OCI layer tarballs during a container build
(even from a later stage in a multi-stage build) but we can bindmount
the root filesystem.

See containers/buildah#5837

With our recent changes to how we handle metadata on the root directory
we should now be producing the same image on the inside and the outside,
which gives us a nice way to produce a UKI with a built-in `composefs=`
command-line parameter.

Add a new 'unified' example.  This does the container build as a single
`podman build` command with no special arguments.

Closes #34

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit that referenced this issue Nov 20, 2024
src/fs.rs contains code for writing the in-memory filesystem tree to a
directory on disk, so let's add the other direction: converting an
on-disk directory to an in-memory filesystem tree.  This will let us
scan container images from inside containers.  This is necessary because
we can't get access to the OCI layer tarballs during a container build
(even from a later stage in a multi-stage build) but we can bindmount
the root filesystem.

See containers/buildah#5837

With our recent changes to how we handle metadata on the root directory
we should now be producing the same image on the inside and the outside,
which gives us a nice way to produce a UKI with a built-in `composefs=`
command-line parameter.

Add a new 'unified' example.  This does the container build as a single
`podman build` command with no special arguments.

Closes #34

Signed-off-by: Allison Karlitskaya <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants