Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uid/gid drift possible when switching between different host builds #673

Open
ggiguash opened this issue Jul 11, 2024 · 13 comments
Open

uid/gid drift possible when switching between different host builds #673

ggiguash opened this issue Jul 11, 2024 · 13 comments
Labels
area/osintegration Relates to an external OS/distro base image

Comments

@ggiguash
Copy link

ggiguash commented Jul 11, 2024

When upgrading from an ostree to a bootc image, the SSH key file permissions may get invalidated in the /etc/ssh directory. This is causing SSH service to fail its startup, effectively preventing the user from remotely accessing the host following the upgrade.

image

(edit by @cgwalters) Originally: https://issues.redhat.com/browse/USHIFT-3749

@ggiguash
Copy link
Author

The following work around can be implemented in a bootc container file to address this issue.

# Implement workarounds necessary for the successful upgrade from an ostree to a bootc image.
# - The permissions of SSH key files at /etc/ssh directory need to be fixed
#   so that they are owned by the 'ssh_keys' group.
RUN mkdir -p /etc/systemd/system/sshd.service.d && \
    printf "\
[Service]\n\
ExecStartPre=/bin/sh -c 'chgrp ssh_keys /etc/ssh/ssh*key*'\n" > "/etc/systemd/system/sshd.service.d/ssh-key-permissions.conf"

@travier
Copy link
Member

travier commented Jul 11, 2024

@cgwalters cgwalters changed the title SSH key permissions are broken when upgrading from an ostree to a bootc image uid/gid drift possible when switching between different host builds Jul 11, 2024
@cgwalters
Copy link
Collaborator

( I've added a link to https://issues.redhat.com/browse/USHIFT-3749 where this was discussed originally to avoid duplicate discussion where relevant)

But again, this issue is not about "ostree vs bootc", it's specially about "ostree commits as built by RHEL for Edge" vs "fedora/c9s/rhel derived bootc" which happen to allocate different uids/gids due to buildsystem drift.

It could also occur when switching between bootc images that have different defined ssh_key gids, and it could also occur when switching between ostree images (again if the buildsystem generating those hasn't ensured they're the same).

Because of this, I don't think this is really a bootc issue at a technical level. It's more a higher level buildsystem issue.

I would accept though that in practice, it may make sense for us to at least track the overall issue here, but...if we were to do something like that ssh key chown hack, I would probably say the code would logically be owned by e.g. https://gitlab.com/fedora/bootc/base-images

Except for the fact that the world does not need more load bearing bash scripts run as unrestricted root, and maybe we actually implement it in Rust here as an opt-in...similarly to coreos/rpm-ostree#4913 in many ways.

@ggiguash
Copy link
Author

ggiguash commented Jul 11, 2024

It could also occur when switching between bootc images that have different defined ssh_key gids, and it could also
occur when switching between ostree images (again if the buildsystem generating those hasn't ensured they're the
same).

The problem can certainly occur when switching between images of the same type, but only in the case when different image parents are used for the base image vs update image. To mitigate this issue, the way we usually build our image hierarchies is to have a base (parent) image and then derive all the updates from it.

The ostree-to-bootc upgrade is a special case because we're guaranteed to have different parents, so it's a 100% failure.
Thus, I thought to emphasize the need to provide a solution for it.

While I prefer this to be handled by the base images we produce, the fallback would be to document all the known pitfalls, which goes back to the #671 issue.

P.S. The title of this issue was changed to be more generic and we may want to go back to the specific problem of ostree-to-bootc upgrade because the generic one is a "known" issue.

@cgwalters cgwalters added the area/osintegration Relates to an external OS/distro base image label Jul 11, 2024
@cgwalters
Copy link
Collaborator

The ostree-to-bootc upgrade is a special case because we're guaranteed to have different parents

No - it's equally possible to build a custom base image using the same buildsystem that produces ostree commits today. In fact, there's significant tooling to streamline exactly that process, such as e.g. rpm-ostree compose container-encapsulate. It's exactly how Fedora CoreOS works today...an ostree commit is generated, and then turned into a container image.

@ggiguash
Copy link
Author

it's equally possible to build a custom base image using the same buildsystem that produces ostree commits today.

I did not know this tooling was in place. I will investigate this direction asap.
Based on this discussion, I agree that we are facing a generic uid/gid mismatch issue, thank you for clarifying it.

@ggiguash
Copy link
Author

ggiguash commented Jul 15, 2024

Further investigation on installing bootc images on top of an unrelated ostree commit led me to a dead end. The problem with the uid/gid is serious. In my case, I got openvswitch and hugetlbfs users/groups overwritten after deploying a bootc image. This makes quite a few services fail on boot and it practically renders the system unusable with no easy way of remediation.

Next, I followed up on using the rpm-ostree compose container-encapsulate command to create a base container image that can be used for bootc container builds and deployment. I completed a full cycle of creating ostree-base-container, derived bootc-container and installing the latter on top of the ostree system. This worked very well - no surprises.

I did encounter a few small issues that might be worth mentioning in the upgrade instructions:

  • The base ostree image should include bootc, dnf and subscription-manager RPM packages
    • bootc is required for switching the image on an ostree system using bootc toolchain
    • dnf is required because many container build procedures use it and it does not come packaged by default with ostree image
    • subscription-manager is required to query/configure RHSM during the container builds or in runtime
  • The following workaround was required in the derived bootc container image build to enable host subscriptions and create missing /opt directory
# Work around the missing ostree-container functionality:
# - Enable librhsm which enables host subscriptions to work in containers
# - Image has /opt symlink pointing to a non-existent /var/opt, so create the latter
RUN ln -sr /run/secrets/etc-pki-entitlement /etc/pki/entitlement-host && \
    ln -sr /run/secrets/rhsm /etc/rhsm-host && \
    mkdir -p /var/opt

@ggiguash
Copy link
Author

@cgwalters , based on my recent experience described in the previous comment, I wonder if we should be stricter in our instructions for ostree-to-bootc upgrade? I mean, should we state that the upgrade is likely to fail unless the bootc container image is derived from the base ostree commit?

@cgwalters
Copy link
Collaborator

The problem with the uid/gid is serious. In my case, I got openvswitch and hugetlbfs users/groups overwritten after deploying a bootc image.

Yeah, it is a really annoying problem. See also openshift/os#1318 where we worked around this just in RHCOS...and that code isn't inherited elsewhere right.

IMO openvswitch especially is a picture perfect use case for systemd DynamicUser=yes too...

cgwalters added a commit to cgwalters/bootc that referenced this issue Nov 5, 2024
cgwalters added a commit to cgwalters/bootc that referenced this issue Nov 6, 2024
@cgwalters
Copy link
Collaborator

Since the ssh_key group is by far the most visible instance of this, it's worth noting that for fedora-bootc its definition is here:
https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/group?ref_type=heads#L26
(Which honestly, is just a fork of FCOS, which is itself a fork of Atomic Host...)

But what travier said is also really important to emphasize - in current Fedora and C10S I think we can just drop the ssh_keys group from the base image.

@cgwalters
Copy link
Collaborator

Let's again narrow in on openvswitch specifically, which has a floating user. Now, it does today have a sysusers.d fragment, but when installed in a container layer, its RPM script ends up creating the user.

However, openvswitch also has a specific problem in that /etc/openvswitch is owned by that uid+gid - so we cannot create it via sysusers because that's "static" content in the image today.

The most direct hack for the openvswitch case today is (at build time) is basically adding a systemd unit that does ExecStartPre=chown -R openvswitch:openvswitch /etc/openvswitch.

But what would be better here...is to not have the user at build time! We should have tooling to "flush" /etc/passwd and /etc/group - basically for each user/group that has a sysusers.d fragment, we undo those changes. This tooling needs to error out if there's any files in the image owned by that uid/gid.

However, even better again than all of that for openvswitch would be DynamicUser=yes - the default config file for openvswitch /etc/openvswitch/default.conf could be owned by root. It looks to me like the rest of the stuff could live in StateDirectory=.

@cgwalters
Copy link
Collaborator

Related to this, cockpit just switched to DynamicUser=yes for similar reasons.

@cgwalters
Copy link
Collaborator

I filed https://issues.redhat.com/browse/RHEL-68655 to engage the openvswitch maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/osintegration Relates to an external OS/distro base image
Projects
None yet
Development

No branches or pull requests

3 participants