Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pxe-offline test fails intermittently #1339

Closed
aaradhak opened this issue Nov 7, 2022 · 14 comments
Closed

pxe-offline test fails intermittently #1339

aaradhak opened this issue Nov 7, 2022 · 14 comments
Labels
jira for syncing to jira kind/bug

Comments

@aaradhak
Copy link
Member

aaradhak commented Nov 7, 2022

Describe the bug

The pxe-offline test fails intermittently in the testing-devel [x86_64] branch
Build failure - 472

18:08:49  FAIL: pxe-offline-install (bios + metal) (1m10.277s)
18:08:49      Got EOF from completion channel, coreos-installer-test-OK expected
18:08:49  Error: Got EOF from completion channel, coreos-installer-test-OK expected
18:08:49  2022-10-28T23:08:42Z cli: Got EOF from completion channel, coreos-installer-test-OK expected
18:08:49  error: failed to execute cmd-kola: exit status 1

Reproduction steps

Not reproducible. Occurs intermittently

Expected behavior

The coreos-installer-service is expected to write the ignition config and first-boot kernel arguments for the installation to complete and the pxe-offline test to PASS.

Actual behavior

In pxe-offline test failure, it was observed that the coreos-installer-service failed to write the ignition config. Also, on further looking at the journal log it was found that the systemd-hostnamed service got deactivated after the coreos-installer-service completed the Read-disk task.

System details

[testing-devel][x86_64] ⚡ 36.20221028.20.0

Ignition config

No response

Additional information

The coreos-installer-service stops after the Read-disk task is complete.
console.txt :

Starting �[0;1;39mcoreos-installer.service�[0m - CoreOS Installer...
[   16.555420] coreos-installer-service[1125]: coreos-installer install /dev/vda --ignition-url http://192.168.76.2:43003/config.ign --insecure-ignition --firstboot-args rd.neednet=1 ip=dhcp  --insecure --fetch-retries infinite
[   17.560258] coreos-installer-service[1142]: Installing Fedora CoreOS 36.20221028.20.0 x86_64 (512-byte sectors)
[   18.574373] coreos-installer-service[1142]: Read disk 128.0 MiB/2.4 GiB (5%)
[   19.574027] coreos-installer-service[1142]: Read disk 287.8 MiB/2.4 GiB (11%)
[   20.574346] coreos-installer-service[1142]: Read disk 445.9 MiB/2.4 GiB (18%)
[   21.574669] coreos-installer-service[1142]: Read disk 609.8 MiB/2.4 GiB (25%)
[   22.574665] coreos-installer-service[1142]: Read disk 725.1 MiB/2.4 GiB (29%)
[   23.574719] coreos-installer-service[1142]: Read disk 865.5 MiB/2.4 GiB (35%)
[   24.574730] coreos-installer-service[1142]: Read disk 1023.2 MiB/2.4 GiB (41%)
[   25.574915] coreos-installer-service[1142]: Read disk 1.1 GiB/2.4 GiB (47%)
[   26.575055] coreos-installer-service[1142]: Read disk 1.3 GiB/2.4 GiB (53%)
[   28.589050] coreos-installer-service[1142]: Read disk 1.3 GiB/2.4 GiB (54%)
[   29.588710] coreos-installer-service[1142]: Read disk 1.4 GiB/2.4 GiB (58%)
[   30.588974] coreos-installer-service[1142]: Read disk 1.5 GiB/2.4 GiB (62%)
[   31.588739] coreos-installer-service[1142]: Read disk 1.6 GiB/2.4 GiB (65%)
[   32.589097] coreos-installer-service[1142]: Read disk 1.6 GiB/2.4 GiB (67%)
[   33.682057] coreos-installer-service[1142]: Read disk 1.6 GiB/2.4 GiB (67%)
[   34.685989] coreos-installer-service[1142]: Read disk 1.7 GiB/2.4 GiB (69%)
[   35.686038] coreos-installer-service[1142]: Read disk 1.7 GiB/2.4 GiB (70%)
[   36.691591] coreos-installer-service[1142]: Read disk 1.7 GiB/2.4 GiB (71%)
[   37.927148] coreos-installer-service[1142]: Read disk 1.8 GiB/2.4 GiB (74%)
[   38.970807] coreos-installer-service[1142]: Read disk 1.8 GiB/2.4 GiB (76%)
[   39.970716] coreos-installer-service[1142]: Read disk 1.8 GiB/2.4 GiB (77%)
[   40.971144] coreos-installer-service[1142]: Read disk 1.9 GiB/2.4 GiB (78%)
[   41.971450] coreos-installer-service[1142]: Read disk 1.9 GiB/2.4 GiB (79%)
[   42.972411] coreos-installer-service[1142]: Read disk 1.9 GiB/2.4 GiB (81%)
[   43.973078] coreos-installer-service[1142]: Read disk 2.0 GiB/2.4 GiB (85%)
[   44.973632] coreos-installer-service[1142]: Read disk 2.1 GiB/2.4 GiB (89%)
[   45.972755] coreos-installer-service[1142]: Read disk 2.2 GiB/2.4 GiB (93%)
[   46.973071] coreos-installer-service[1142]: Read disk 2.3 GiB/2.4 GiB (96%)
[   47.674720] coreos-installer-service[1142]: Read disk 2.4 GiB/2.4 GiB (100%)
[   47.678823] coreos-installer-service[1142]: Read disk 2.4 GiB/2.4 GiB (100%)

journal.txt:

Read disk 2.4 GiB/2.4 GiB (100%)
Read disk 2.4 GiB/2.4 GiB (100%)
SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
systemd-hostnamed.service: Deactivated successfully.
audit: type=1131 audit(1666998518.206:239): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
@aaradhak aaradhak added kind/bug jira for syncing to jira labels Nov 7, 2022
@dustymabe
Copy link
Member

so it just appears to hang?

@aaradhak
Copy link
Member Author

aaradhak commented Nov 7, 2022

yes, it hangs after the read-disk task is complete.

@marmijo
Copy link
Member

marmijo commented Nov 8, 2022

I'm not very familiar with Rust or coreos-installer. It seems like coreos-installer install is failing, without error, before getting to this line: src/install.rs #L222. I'm still trying to look into what could be going on.

@dustymabe
Copy link
Member

Saw this in bump-lockfile#250 today.

@aaradhak
Copy link
Member Author

aaradhak commented Nov 8, 2022

Observed the same failure in bump-lockfile last week too build500.

This pxe-install test failure seems to be slightly different from the pxe-offline test failure.

Looking at the console log, it seems to hang after finishing coreos-test-insta…TestISO Verify No Ignition Config

[�[0;32m  OK  �[0m] Started �[0;1;39msystemd-homed.service�[0m - Home Area Manager.
[�[0;32m  OK  �[0m] Finished �[0;1;39msystemd-homed-act…ervice�[0m - Home Area Activation.
[�[0;32m  OK  �[0m] Finished �[0;1;39mcoreos-ignition-f…oreOS Mark Ignition Boot Complete.
         Starting �[0;1;39mcoreos-testinsta…stISO Verify No Ignition Config...
[�[0;32m  OK  �[0m] Finished �[0;1;39mcoreos-test-insta…TestISO Verify No Ignition Config.
         Starting �[0;1;39mcoreos-test-insta…[0m - TestISO Signal Completion...

@dustymabe
Copy link
Member

Seen in bump-lockfile#257

@jlebon
Copy link
Member

jlebon commented Nov 10, 2022

I'm like 95% sure this is actually the QEMU process getting OOMkilled. Some clues are:

  • the abrupt end of console and journal logs
  • EOF from the virtio-serial port makes sense if QEMU suddenly dies
  • this is only happening on x86_64, which is the only arch where this test runs under cgroups (other arches run unconstrained on builders)

I started a patch to enhance cosa to better report when the test fails because QEMU died, but ran out of cycles for today. I'd say for now let's just bump the memory requests in build and bump-lockfile. But we should dig into why our memory calculations are off.

jlebon added a commit to jlebon/coreos-assembler that referenced this issue Nov 11, 2022
Right now when QEMU gets killed during a testiso run, the error is:

    Got EOF from completion channel, coreos-installer-test-OK expected

This is accurate but doesn't hint well enough at the underlying cause.
Rework the two spots in which we wait for virtio-serial strings to also
check if QEMU was killed to provide a better error. E.g.:

    FAIL: pxe-install (bios + metal) (34.483s)
        QEMU unexpectedly exited while waiting awaiting completion: process killed

Related: coreos/fedora-coreos-tracker#1339
jlebon added a commit to jlebon/coreos-assembler that referenced this issue Nov 11, 2022
Right now when QEMU gets killed during a testiso run, the error is:

    FAIL: pxe-offline-install (bios + metal) (1m10.277s)
        Got EOF from completion channel, coreos-installer-test-OK expected

This is accurate but doesn't hint well enough at the underlying cause.
Rework the two spots in which we wait for virtio-serial strings to also
check if QEMU was killed to provide a better error. E.g.:

    FAIL: pxe-install (bios + metal) (34.483s)
        QEMU unexpectedly exited while waiting awaiting completion: process killed

Related: coreos/fedora-coreos-tracker#1339
@jlebon
Copy link
Member

jlebon commented Nov 11, 2022

Opened coreos/coreos-assembler#3192 which improves the error message in that case. If we merge that, we should be able to confirm that's what's happening before bumping memory requests.

@marmijo
Copy link
Member

marmijo commented Nov 11, 2022

Great find! It'll be nice to see some error messages come through. The lack of them has made this difficult to debug.

I'm like 95% sure this is actually the QEMU process getting OOMkilled. Some clues are:

  • the abrupt end of console and journal logs

That would make sense. The console and journal logs are stopping at the same point each time meaning it could be exceeding the memory limit consistently at the same place.

dustymabe pushed a commit to coreos/coreos-assembler that referenced this issue Nov 11, 2022
Right now when QEMU gets killed during a testiso run, the error is:

    FAIL: pxe-offline-install (bios + metal) (1m10.277s)
        Got EOF from completion channel, coreos-installer-test-OK expected

This is accurate but doesn't hint well enough at the underlying cause.
Rework the two spots in which we wait for virtio-serial strings to also
check if QEMU was killed to provide a better error. E.g.:

    FAIL: pxe-install (bios + metal) (34.483s)
        QEMU unexpectedly exited while waiting awaiting completion: process killed

Related: coreos/fedora-coreos-tracker#1339
@dustymabe
Copy link
Member

ok now we see:

[2022-11-11T21:02:24.645Z] Ignoring verification of signature on metal image
[2022-11-11T21:04:15.969Z] PASS: pxe-install (bios + metal) (1m36.149s)
[2022-11-11T21:06:07.307Z] PASS: pxe-offline-install (bios + metal) (1m49.939s)
[2022-11-11T21:08:13.638Z] PASS: iso-offline-install (bios + metal) (2m6.52s)
[2022-11-11T21:08:35.480Z] PASS: iso-live-login (bios + metal) (21.395s)
[2022-11-11T21:08:50.283Z] PASS: iso-as-disk (bios + metal) (15.234s)
[2022-11-11T21:10:41.594Z] FAIL: miniso-install (bios + metal) (1m31.793s)
[2022-11-11T21:10:41.594Z]     QEMU unexpectedly exited while awaiting completion: process killed
[2022-11-11T21:10:41.594Z] Error: QEMU unexpectedly exited while awaiting completion: process killed
[2022-11-11T21:10:41.594Z] 2022-11-11T21:10:29Z cli: QEMU unexpectedly exited while awaiting completion: process killed
[2022-11-11T21:10:41.594Z] error: failed to execute cmd-kola: exit status 1
script returned exit code 1

from build#560

dustymabe added a commit to dustymabe/fedora-coreos-pipeline that referenced this issue Nov 11, 2022
We're seeing an obscure `Got EOF from completion channel` error
when running testISO. The running theory is that we're running
out of memory and that seems to be accurate in a test I did earlier
today. Let's bump for now and investigate when we have more time.

See coreos/fedora-coreos-tracker#1339
dustymabe added a commit to coreos/fedora-coreos-pipeline that referenced this issue Nov 11, 2022
We're seeing an obscure `Got EOF from completion channel` error
when running testISO. The running theory is that we're running
out of memory and that seems to be accurate in a test I did earlier
today. Let's bump for now and investigate when we have more time.

See coreos/fedora-coreos-tracker#1339
@jlebon
Copy link
Member

jlebon commented Nov 17, 2022

Let's optimistically close this as fixed by coreos/fedora-coreos-pipeline#750.

@jlebon jlebon closed this as completed Nov 17, 2022
@dustymabe
Copy link
Member

I think we still need to do some sort of analysis on why our previous calculations for max memory weren't accurate.

@jlebon
Copy link
Member

jlebon commented Nov 17, 2022

Agreed. Should we track that in a new issue against the pipeline instead?

@dustymabe
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/bug
Projects
None yet
Development

No branches or pull requests

4 participants