[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

aaradhak · 2025-01-02T22:42:15Z

Describe the bug

Multiple kola tests in the rawhide build seem to fail because of the systemd pkg upgrade systemd 257-1.fc42 -> 257.1-1.fc42

Listing few of the kola test failures below

[2025-01-02T20:29:19.241Z] --- FAIL: ext.config.toolbox (22.25s)

[2025-01-02T20:29:19.241Z]         harness.go:1823: mach.Start() failed: machine "d8d8cd2b-e47e-428c-92d9-920a3df98f67" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:37.277Z] --- FAIL: ext.config.var-mount.scsi-id (33.90s)

[2025-01-02T20:29:37.277Z]         harness.go:1823: mach.Start() failed: machine "91bd53f7-d9e9-46bd-9a25-fe7044689fbb" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:37.277Z] --- FAIL: coreos.ignition.mount.disks (36.07s)

[2025-01-02T20:29:37.277Z]         mount.go:129: machine "2a724ae9-ffc7-4ad7-be71-9a7c45a9540b" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:38.202Z] --- FAIL: ext.config.selinux.enforcing (21.94s)

[2025-01-02T20:29:38.202Z]         harness.go:1823: mach.Start() failed: machine "0a2a33aa-7940-4f2e-a67e-4d07d3f433f8" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

[2025-01-02T20:29:41.460Z] --- FAIL: podman.workflow (22.19s)

[2025-01-02T20:29:41.460Z]         harness.go:1823: mach.Start() failed: machine "bdc00777-db18-46ed-b387-cd3b859c7d54" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

The console log carried messages regarding failure to start Network Name Resolution.

See 'systemctl status systemd-resolved.service' for details.

         Starting �[0;1;39msystemd-resolved.service�[0m - Network Name Resolution...

[�[0;32m  OK  �[0m] Finished �[0;1;39msystemd-tmpfiles-setup-de…�[0mCreate Static Device Nodes in /dev.

[�[0;32m  OK  �[0m] Reached target �[0;1;39mlocal-fs-pre.target�[0m…Preparation for Local File Systems.

[�[0;1;31mFAILED�[0m] Failed to start �[0;1;39msystemd-resolved.service�[0m - Network Name Resolution.

Reproduction steps

Run the latest rawhide-x86_64 build
cosa fetch && cosa build
kola run ext.config.toolbox (one of the test failures)

Expected behavior

ext.config.toolbox (including other tests) kola tests to pass

Actual behavior

Multiple Kola tests fail with error

[2025-01-02T20:29:19.241Z] harness.go:1823: mach.Start() failed: machine "d8d8cd2b-e47e-428c-92d9-920a3df98f67" failed basic checks: detected failed or stuck systemd units: some systemd units failed: systemd-resolved.service; <nil>

System details

rawhide - x86_64

Butane or Ignition config

No response

Additional information

console.txt
journal.txt

The text was updated successfully, but these errors were encountered:

aaradhak · 2025-01-02T22:49:49Z

Pinned systemd to systemd-257-1.fc42
Override PR - coreos/fedora-coreos-config#3310

keszybz · 2025-01-08T16:26:02Z

Jan  2 20:14:44.237412 init.scope[1]: Starting systemd-resolved.service - Network Name Resolution...
Jan  2 20:14:44.241388 init.scope[1]: Starting systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev...
Jan  2 20:14:44.326712 systemd-resolved.service[1499]: Failed to create destination mount point node '/run/systemd/mount-rootfs/var/tmp', ignoring: Read-only file system
Jan  2 20:14:44.329572 systemd-resolved.service[1499]: Failed to mount /run/systemd/unit-private-tmp/var-tmp to /run/systemd/mount-rootfs/var/tmp: No such file or directory
Jan  2 20:14:44.329606 systemd-resolved.service[1499]: systemd-resolved.service: Failed to set up mount namespacing: /var/tmp: No such file or directory
Jan  2 20:14:44.329637 systemd-resolved.service[1499]: systemd-resolved.service: Failed at step NAMESPACE spawning /usr/lib/systemd/systemd-resolved: No such file or directory
Jan  2 20:14:44.336540 init.scope[1]: systemd-resolved.service: Main process exited, code=exited, status=226/NAMESPACE
Jan  2 20:14:44.336746 init.scope[1]: systemd-resolved.service: Failed with result 'exit-code'.
Jan  2 20:14:44.337678 init.scope[1]: Failed to start systemd-resolved.service - Network Name Resolution.

This "init.scope" is very strange. It would seem that journald was not able to figure out how PID1 is called. This may happen for other processes if they die very quickly, but of course this doesn't apply to PID1. I have never seen this before.

I looks like EROFS is the original error.

There is nothing in v257.2 that looks related. There was a commit in v257.1 (systemd/systemd@1f6e192848). Are you sure that it's a regression between .1 and .2?

keszybz · 2025-01-08T16:46:11Z

I was made aware that I misread the initial report and it's a regression from 257. So indeed systemd/systemd@1f6e192848 is probably the culprit. The question is what is special in this environment to trigger the issue.

cgwalters · 2025-01-08T16:52:08Z

What's special is likely that we create /var/tmp from systemd-tmpfiles

keszybz · 2025-01-08T17:35:26Z

Hmm, if /var/tmp doesn't exist, systemd should gracefully not create a private /var/tmp. I think there was some code for this, but it's been a long time since this came up. I'll check if it's there and if it handles EROFS.

cgwalters · 2025-01-08T17:39:03Z

Or at least, we can in some scenarios.

Though note for Anaconda installs, we run systemd-tmpfiles for a select set of paths before rebooting (because anaconda has %post which often expects these things to exist...). And Anaconda is how Silverblue etc. get installed.

Humm though one interesting thing here actually is that while the FCOS container we do have /var/tmp

$ podman run --rm -ti quay.io/fedora/fedora-coreos:stable ls -al /var
total 0
drwxr-xr-x. 3 root root 17 Jan  1  1970 .
dr-xr-xr-x. 1 root root 28 Jan  8 17:37 ..
drwxrwxrwt. 2 root root  6 Jan  1  1970 tmp
$

The qcow2 disk image doesn't somehow:

guestfish --ro -a fedora-coreos-41.20250105.1.1-qemu.x86_64.qcow2

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
      ‘man’ to read the manual
      ‘quit’ to quit the shell

><fs> run
list-filesystems><fs> list-filesystems
/dev/sda1: unknown
/dev/sda2: vfat
/dev/sda3: ext4
/dev/sda4: xfs
><fs> mount /dev/sda4 /
><fs> ls /ostree/deploy/fedora-coreos/var/
.ostree-selabeled
><fs>

Which is quite surprising to me because ostree as of recently does automatically copy data from /var on initial deployments. Maybe something related to the osbuild disk pipeline?

cgwalters · 2025-01-08T17:41:21Z

I'll check if it's there and if it handles EROFS.

In the scenario under test here, the filesystem is writable - /var is a bind mount into the deployment stateroot.

So what would need to be handled is ENOENT - which may be what you meant.

AdamWill · 2025-01-09T16:46:05Z

I think this is probably https://bugzilla.redhat.com/show_bug.cgi?id=2334015 ?

edit: well, hmm, it's a bit different as you're getting "Read-only file system" not "Permission denied"...probably caused by the same upstream change, though.

... (rhbz#2334015, coreos/fedora-coreos-tracker#1857)

keszybz · 2025-01-11T10:16:37Z

systemd-257.2-4 has the offending patch reverted.

aaradhak · 2025-01-13T19:03:42Z

I checked the kola tests locally with systemd-257.2-4.fc42 and they seem to PASS.

aaradhak · 2025-01-13T19:27:18Z

Opened a fast-track PR for this - coreos/fedora-coreos-config#3316

dustymabe · 2025-01-14T17:33:46Z

Should be taken care of with coreos/fedora-coreos-config#3316

cgwalters · 2025-01-20T16:08:11Z

This is the cause of https://bugzilla.redhat.com/show_bug.cgi?id=2339009

cgwalters · 2025-01-20T16:14:36Z

I will say here that the lockfile stuff makes total sense in isolation but the only long term sustainable solution is actually to support reverting package builds always. For exactly things like this, systemd should have been reverted everywhere not just pinned in FCOS.

Not just because we have other deliverables (e.g. bootc) that aren't pinning (semi intentionally) today but just because pinning in one place creates confusing technical debt that FCOS maintainers need to remember to clean up - which is exactly what happened in this case.

So again we should be aiming towards requiring PRs for packages, and having reasonable gating there.

See: coreos/fedora-coreos-tracker#1857 See: https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin

dustymabe · 2025-01-20T16:30:48Z

We had a PR to drop the override that slipped through the cracks apparently. coreos/fedora-coreos-config#3318

I've merged that now.

AdamWill · 2025-01-20T16:32:58Z

openQA would usually have gated something like this, but weirdly the bug was specific to certain environments (Cloud and CoreOS, not sure what the common attribute is?) and we don't run any tests of those on updates, so the tests all passed :/ We do test service startup on server, KDE and workstation environments, so if they'd been affected the update would've been gated. In the event, we only caught it when it landed in a compose and the cloud tests ran.

aaradhak added kind/bug pipeline failure This issue or pull request is derived from CI failures labels Jan 2, 2025

dustymabe changed the title ~~[rawhide][x86_64] : kola tests fail due to systemd pkg upgarde~~ [rawhide][x86_64] : kola tests fail due to systemd pkg upgrade Jan 3, 2025

bluca pushed a commit to bluca/systemd-fedora that referenced this issue Jan 10, 2025

Revert use of PrivateTmp=disconnected

b1bd57e

... (rhbz#2334015, coreos/fedora-coreos-tracker#1857)

dustymabe closed this as completed Jan 14, 2025

travier added a commit to travier/fedora-coreos-config that referenced this issue Jan 20, 2025

overrides: Drop systemd-257-1.fc42 pin

f389e72

See: coreos/fedora-coreos-tracker#1857 See: https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin

travier mentioned this issue Jan 20, 2025

[rawhide] overrides: Drop systemd-257-1.fc42 pin coreos/fedora-coreos-config#3325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

aaradhak commented Jan 2, 2025 •

edited

Loading

aaradhak commented Jan 2, 2025

keszybz commented Jan 8, 2025

keszybz commented Jan 8, 2025

cgwalters commented Jan 8, 2025

keszybz commented Jan 8, 2025

cgwalters commented Jan 8, 2025

cgwalters commented Jan 8, 2025

AdamWill commented Jan 9, 2025 •

edited

Loading

keszybz commented Jan 11, 2025

aaradhak commented Jan 13, 2025

aaradhak commented Jan 13, 2025

dustymabe commented Jan 14, 2025

cgwalters commented Jan 20, 2025

cgwalters commented Jan 20, 2025

dustymabe commented Jan 20, 2025

AdamWill commented Jan 20, 2025

[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

[rawhide][x86_64] : kola tests fail due to systemd pkg upgrade #1857

Comments

aaradhak commented Jan 2, 2025 • edited Loading

Describe the bug

Reproduction steps

Expected behavior

Actual behavior

System details

Butane or Ignition config

Additional information

aaradhak commented Jan 2, 2025

keszybz commented Jan 8, 2025

keszybz commented Jan 8, 2025

cgwalters commented Jan 8, 2025

keszybz commented Jan 8, 2025

cgwalters commented Jan 8, 2025

cgwalters commented Jan 8, 2025

AdamWill commented Jan 9, 2025 • edited Loading

keszybz commented Jan 11, 2025

aaradhak commented Jan 13, 2025

aaradhak commented Jan 13, 2025

dustymabe commented Jan 14, 2025

cgwalters commented Jan 20, 2025

cgwalters commented Jan 20, 2025

dustymabe commented Jan 20, 2025

AdamWill commented Jan 20, 2025

aaradhak commented Jan 2, 2025 •

edited

Loading

AdamWill commented Jan 9, 2025 •

edited

Loading