Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

Closed
aaradhak opened this issue Jan 2, 2025 · 16 comments
Closed

[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

aaradhak opened this issue Jan 2, 2025 · 16 comments
Labels
kind/bug pipeline failure This issue or pull request is derived from CI failures

Comments

@aaradhak
Copy link
Member

aaradhak commented Jan 2, 2025

Describe the bug

Multiple kola tests in the rawhide build seem to fail because of the systemd pkg upgrade systemd 257-1.fc42 -> 257.1-1.fc42

Listing few of the kola test failures below

[2025-01-02T20:29:19.241Z] --- FAIL: ext.config.toolbox (22.25s)

[2025-01-02T20:29:19.241Z]         harness.go:1823: mach.Start() failed: machine "d8d8cd2b-e47e-428c-92d9-920a3df98f67" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:37.277Z] --- FAIL: ext.config.var-mount.scsi-id (33.90s)

[2025-01-02T20:29:37.277Z]         harness.go:1823: mach.Start() failed: machine "91bd53f7-d9e9-46bd-9a25-fe7044689fbb" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:37.277Z] --- FAIL: coreos.ignition.mount.disks (36.07s)

[2025-01-02T20:29:37.277Z]         mount.go:129: machine "2a724ae9-ffc7-4ad7-be71-9a7c45a9540b" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:38.202Z] --- FAIL: ext.config.selinux.enforcing (21.94s)

[2025-01-02T20:29:38.202Z]         harness.go:1823: mach.Start() failed: machine "0a2a33aa-7940-4f2e-a67e-4d07d3f433f8" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:41.460Z] --- FAIL: podman.workflow (22.19s)

[2025-01-02T20:29:41.460Z]         harness.go:1823: mach.Start() failed: machine "bdc00777-db18-46ed-b387-cd3b859c7d54" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

The console log carried messages regarding failure to start Network Name Resolution.

See 'systemctl status systemd-resolved.service' for details.

         Starting �[0;1;39msystemd-resolved.service�[0m - Network Name Resolution...

[�[0;32m  OK  �[0m] Finished �[0;1;39msystemd-tmpfiles-setup-de…�[0mCreate Static Device Nodes in /dev.

[�[0;32m  OK  �[0m] Reached target �[0;1;39mlocal-fs-pre.target�[0m…Preparation for Local File Systems.

[�[0;1;31mFAILED�[0m] Failed to start �[0;1;39msystemd-resolved.service�[0m - Network Name Resolution.

Reproduction steps

  1. Run the latest rawhide-x86_64 build
    cosa fetch && cosa build
  2. kola run ext.config.toolbox (one of the test failures)

Expected behavior

ext.config.toolbox (including other tests) kola tests to pass

Actual behavior

Multiple Kola tests fail with error

[2025-01-02T20:29:19.241Z] harness.go:1823: mach.Start() failed: machine "d8d8cd2b-e47e-428c-92d9-920a3df98f67" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

System details

rawhide - x86_64

Butane or Ignition config

No response

Additional information

console.txt
journal.txt

@aaradhak aaradhak added kind/bug pipeline failure This issue or pull request is derived from CI failures labels Jan 2, 2025
@aaradhak
Copy link
Member Author

aaradhak commented Jan 2, 2025

Pinned systemd to systemd-257-1.fc42
Override PR - coreos/fedora-coreos-config#3310

@dustymabe dustymabe changed the title [rawhide][x86_64] : kola tests fail due to systemd pkg upgarde [rawhide][x86_64] : kola tests fail due to systemd pkg upgrade Jan 3, 2025
@keszybz
Copy link

keszybz commented Jan 8, 2025

Jan  2 20:14:44.237412 init.scope[1]: Starting systemd-resolved.service - Network Name Resolution...
Jan  2 20:14:44.241388 init.scope[1]: Starting systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev...
Jan  2 20:14:44.326712 systemd-resolved.service[1499]: Failed to create destination mount point node '/run/systemd/mount-rootfs/var/tmp', ignoring: Read-only file system
Jan  2 20:14:44.329572 systemd-resolved.service[1499]: Failed to mount /run/systemd/unit-private-tmp/var-tmp to /run/systemd/mount-rootfs/var/tmp: No such file or directory
Jan  2 20:14:44.329606 systemd-resolved.service[1499]: systemd-resolved.service: Failed to set up mount namespacing: /var/tmp: No such file or directory
Jan  2 20:14:44.329637 systemd-resolved.service[1499]: systemd-resolved.service: Failed at step NAMESPACE spawning /usr/lib/systemd/systemd-resolved: No such file or directory
Jan  2 20:14:44.336540 init.scope[1]: systemd-resolved.service: Main process exited, code=exited, status=226/NAMESPACE
Jan  2 20:14:44.336746 init.scope[1]: systemd-resolved.service: Failed with result 'exit-code'.
Jan  2 20:14:44.337678 init.scope[1]: Failed to start systemd-resolved.service - Network Name Resolution.

This "init.scope" is very strange. It would seem that journald was not able to figure out how PID1 is called. This may happen for other processes if they die very quickly, but of course this doesn't apply to PID1. I have never seen this before.

I looks like EROFS is the original error.

There is nothing in v257.2 that looks related. There was a commit in v257.1 (systemd/systemd@1f6e192848). Are you sure that it's a regression between .1 and .2?

@keszybz
Copy link

keszybz commented Jan 8, 2025

I was made aware that I misread the initial report and it's a regression from 257. So indeed systemd/systemd@1f6e192848 is probably the culprit. The question is what is special in this environment to trigger the issue.

@cgwalters
Copy link
Member

What's special is likely that we create /var/tmp from systemd-tmpfiles

@keszybz
Copy link

keszybz commented Jan 8, 2025

Hmm, if /var/tmp doesn't exist, systemd should gracefully not create a private /var/tmp. I think there was some code for this, but it's been a long time since this came up. I'll check if it's there and if it handles EROFS.

@cgwalters
Copy link
Member

Or at least, we can in some scenarios.

Though note for Anaconda installs, we run systemd-tmpfiles for a select set of paths before rebooting (because anaconda has %post which often expects these things to exist...). And Anaconda is how Silverblue etc. get installed.

Humm though one interesting thing here actually is that while the FCOS container we do have /var/tmp

$ podman run --rm -ti quay.io/fedora/fedora-coreos:stable ls -al /var
total 0
drwxr-xr-x. 3 root root 17 Jan  1  1970 .
dr-xr-xr-x. 1 root root 28 Jan  8 17:37 ..
drwxrwxrwt. 2 root root  6 Jan  1  1970 tmp
$

The qcow2 disk image doesn't somehow:

guestfish --ro -a fedora-coreos-41.20250105.1.1-qemu.x86_64.qcow2

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
      ‘man’ to read the manual
      ‘quit’ to quit the shell

><fs> run
list-filesystems><fs> list-filesystems
/dev/sda1: unknown
/dev/sda2: vfat
/dev/sda3: ext4
/dev/sda4: xfs
><fs> mount /dev/sda4 /
><fs> ls /ostree/deploy/fedora-coreos/var/
.ostree-selabeled
><fs> 

Which is quite surprising to me because ostree as of recently does automatically copy data from /var on initial deployments. Maybe something related to the osbuild disk pipeline?

@cgwalters
Copy link
Member

I'll check if it's there and if it handles EROFS.

In the scenario under test here, the filesystem is writable - /var is a bind mount into the deployment stateroot.

So what would need to be handled is ENOENT - which may be what you meant.

@AdamWill
Copy link

AdamWill commented Jan 9, 2025

I think this is probably https://bugzilla.redhat.com/show_bug.cgi?id=2334015 ?

edit: well, hmm, it's a bit different as you're getting "Read-only file system" not "Permission denied"...probably caused by the same upstream change, though.

bluca pushed a commit to bluca/systemd-fedora that referenced this issue Jan 10, 2025
@keszybz
Copy link

keszybz commented Jan 11, 2025

systemd-257.2-4 has the offending patch reverted.

@aaradhak
Copy link
Member Author

I checked the kola tests locally with systemd-257.2-4.fc42 and they seem to PASS.

@aaradhak
Copy link
Member Author

Opened a fast-track PR for this - coreos/fedora-coreos-config#3316

@dustymabe
Copy link
Member

Should be taken care of with coreos/fedora-coreos-config#3316

@cgwalters
Copy link
Member

This is the cause of https://bugzilla.redhat.com/show_bug.cgi?id=2339009

@cgwalters
Copy link
Member

I will say here that the lockfile stuff makes total sense in isolation but the only long term sustainable solution is actually to support reverting package builds always. For exactly things like this, systemd should have been reverted everywhere not just pinned in FCOS.

Not just because we have other deliverables (e.g. bootc) that aren't pinning (semi intentionally) today but just because pinning in one place creates confusing technical debt that FCOS maintainers need to remember to clean up - which is exactly what happened in this case.

So again we should be aiming towards requiring PRs for packages, and having reasonable gating there.

@dustymabe
Copy link
Member

We had a PR to drop the override that slipped through the cracks apparently. coreos/fedora-coreos-config#3318

I've merged that now.

@AdamWill
Copy link

openQA would usually have gated something like this, but weirdly the bug was specific to certain environments (Cloud and CoreOS, not sure what the common attribute is?) and we don't run any tests of those on updates, so the tests all passed :/ We do test service startup on server, KDE and workstation environments, so if they'd been affected the update would've been gated. In the event, we only caught it when it landed in a compose and the cloud tests ran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug pipeline failure This issue or pull request is derived from CI failures
Projects
None yet
Development

No branches or pull requests

5 participants