Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make our Dangerzone image reproducible #1049

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

apyrgio
Copy link
Contributor

@apyrgio apyrgio commented Jan 14, 2025

This PR makes our container image reproducible, and enforces it with a CI job on every commit. To do so, we also change our base image from Alpine Linux to Debian Stable, and we use some prior art and tooling from the work that @AkihiroSuda has done on reproducible containers.

Fixes #1046
Fixes #1047
Fixes #1048

Move container-only build context - currently just the entrypoint script
- from `dangerzone/gvisor_wrapper` to `dangerzone/container`. Update the
  rest of the scripts to use this location as well.
Remove the need to copy the Dangerzone container image (used by the
inner container) within a wrapper gVisor image (used by the outer
container). Instead, use the root of the container filesystem for both
containers. We can do this safely because we don't mount any secrets to
the container, and because gVisor offers a read-only view of the
underlying filesystem

Fixes #1048
Download and copy the following artifacts that will be used for building
a Debian-based Dangerzone container image in the subsequent commits:
* The APT key for the gVisor repo [1]
* A helper script for building reproducible Debian images [2]

[1] https://gvisor.dev/archive.key
[2] https://github.com/reproducible-containers/repro-sources-list.sh/blob/d15cf12b26395b857b24fba223b108aff1c91b26/repro-sources-list.sh
Switch base image from Alpine Linux to Debian Stable, in order to reduce
our image footprint, improve our security posture, and build our
container image reproducibly.

Fixes #1046
Refs #1047
Remove all the scaffolding in our `build-image.py` script for using the
`poetry.lock` file, now that we install PyMuPDF from the Debian repos.
Remove our suggestions for not using the container cache, which stemmed
from the fact that our Dangerzone image was not reproducible. Now that
we have switched to Debian Stable and the Dockerfile is all we need to
reproducibly build the exact same container image, we can just use the
cache to speed up builds.
Add jinja2-cli as a package dependency, since it will be used to create
the Dockerfile from some user parameters and a template.
Allow updating the Dockerfile from a template and some envs, so that
it's easier to bump the dates in it.
Allow setting a tag for the container image, when building it with the
`build-image.py` script. This should be used for development purposes
only, since the proper image name should be dictated by the script.
Add a dev script for Linux platforms that verifies that a source image
can be reproducibly built from the current Git commit. The
reproducibility check is enforced by the `diffoci` tool, which is
downloaded as part of running the script.
Add a CI job that uses the `reproduce.py` dev script to enforce image
reproducibility, for every PR that we send to the repo.

Fixes #1047
@apyrgio apyrgio force-pushed the 1046-reproducibility branch from 5cea5e1 to 279322b Compare January 14, 2025 09:58
@apyrgio apyrgio force-pushed the 1046-reproducibility branch from c1f5d75 to e02dbfd Compare January 14, 2025 10:29
Copy link
Contributor

@almet almet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this!

I added a few comments here and there, and will probably come back to it once I've run it on my laptop. It's pretty neat!

.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/ci.yml Outdated Show resolved Hide resolved

FROM debian:bookworm-${DEBIAN_IMAGE_DATE}-slim

ARG GVISOR_ARCHIVE_DATE=20250106
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding links here could allow us to follow them when we want to update the versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we get these directly from the Dockerfile.env file instead of defining them twice?

Copy link
Contributor Author

@apyrgio apyrgio Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. So, I've created an issue for automating the bump of these values, since I wanted to pursue it in this PR, but I didn't have time (check #1050). As for the URLs, I've added them to Dockerfile.env as comments (see 96ab442, which also fixes a bug). As for the values in Dockerfile, they are actually generated from the following command:

poetry run jinja2 Dockerfile.in Dockerfile.env > Dockerfile

So, they are practically defined once. There's also a CI job that makes sure that the Dockerfile does not deviate from the template and committed params. Are you ok with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize Dockerfile was an artifact now then.

Should we then just check-in Dockerfile.in in our Git tree and remove Dockerfile from it (and add it to .gitignore)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't hurt committing it, and it's easier for viewers of the repo to see at a glance what are the contents of the image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the "cons" side, it means that all our changes to the containers will be duplicated in these two files each time we do a change. I would prefer otherwise but can live with it if you have strong feeling on this.

If we decide to go down that road, we might want to have a CI task ensuring that Docker is then the output of Docker.in processed so the two don't end up out of sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to go down that road, we might want to have a CI task ensuring that Docker is then the output of Docker.in processed so the two don't end up out of sync.

Yeap, I agree. As I mentioned in #1049 (comment), there's one already in place:

- name: Verify that the Dockerfile matches the commited template and params
run: |-
poetry run jinja2 Dockerfile.in Dockerfile.env > out
diff Dockerfile out

Dockerfile Outdated

# Create an unprivileged user both for gVisor and for running Dangerzone.
RUN mkdir -p /opt/dangerzone/dangerzone && \
touch /opt/dangerzone/dangerzone/__init__.py && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to create this __init__.py file here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, fixed in cbeb103

conversion/common.py \
conversion/errors.py \
conversion/__init__.py \
/opt/dangerzone/dangerzone/conversion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you're copying files from the dangerzone.conversion python package. Should we copy all of the ones from this folder instead? Also, I'm starting to wonder if it could make sense to have a separate package for this, if we start to copy files manually. Just a thought, but I'm curious about what you'd think of it 💭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm copying the .py files specifically for two reasons: a) to avoid picking up the .pyc files, and b) to make the list of copied files more explicit for reproducibility reasons. As for having a separate package, I think it will add a bit more complexity than strictly necessary for this invocation. I'd like to know if you have something particular in mind though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit brittle to copy these manually, as it can break when we start adding dependencies to these files for instance. Manually choosing what to copy adds some logic here that could be removed by grouping the items we want to copy together and copying the folder.

I get what you mean about the .pyc files, I suppose it's possible to remove them after copying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think a middle ground is to copy conversion/*.py, which would future-proof us against that scenario you mention. What do you think?

docs/developer/reproducibility.md Outdated Show resolved Hide resolved
docs/developer/reproducibility.md Outdated Show resolved Hide resolved
--ignore-timestamps --ignore-image-name --verbose
```

### Updating the image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we're missing a howto check reproducibility for old images. Maybe rephrasing a bit this section could help in this regard (as I believe it would be updating the version numbers / dates and rebuilding the image, then diffing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have a "Reproducing the image" section in the design doc, not sure if you've seen it. If it's not clear what the user should do, here's an update in the wording: aa710e8

# 3. If the contents of the Git repo are dirty, we will append a unique identifier
# for this run, something like `0.8.0-31-g6bdaa7a-fdcb` or `0.8.0-fdcb`.
dirty_ident = secrets.token_hex(2)
tag = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could go in its own function to ease readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 92d8a4c

@@ -28,7 +28,7 @@ def main():
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, I wonder if we should rename the vendor-pymupdf to something specifically targeting debian-based distributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
2 participants