-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pxe-offline test fails intermittently #1339
Comments
so it just appears to hang? |
yes, it hangs after the read-disk task is complete. |
I'm not very familiar with Rust or coreos-installer. It seems like |
Saw this in bump-lockfile#250 today. |
Observed the same failure in bump-lockfile last week too build500. This Looking at the console log, it seems to hang after finishing
|
Seen in bump-lockfile#257 |
I'm like 95% sure this is actually the QEMU process getting OOMkilled. Some clues are:
I started a patch to enhance cosa to better report when the test fails because QEMU died, but ran out of cycles for today. I'd say for now let's just bump the memory requests in |
Right now when QEMU gets killed during a testiso run, the error is: Got EOF from completion channel, coreos-installer-test-OK expected This is accurate but doesn't hint well enough at the underlying cause. Rework the two spots in which we wait for virtio-serial strings to also check if QEMU was killed to provide a better error. E.g.: FAIL: pxe-install (bios + metal) (34.483s) QEMU unexpectedly exited while waiting awaiting completion: process killed Related: coreos/fedora-coreos-tracker#1339
Right now when QEMU gets killed during a testiso run, the error is: FAIL: pxe-offline-install (bios + metal) (1m10.277s) Got EOF from completion channel, coreos-installer-test-OK expected This is accurate but doesn't hint well enough at the underlying cause. Rework the two spots in which we wait for virtio-serial strings to also check if QEMU was killed to provide a better error. E.g.: FAIL: pxe-install (bios + metal) (34.483s) QEMU unexpectedly exited while waiting awaiting completion: process killed Related: coreos/fedora-coreos-tracker#1339
Opened coreos/coreos-assembler#3192 which improves the error message in that case. If we merge that, we should be able to confirm that's what's happening before bumping memory requests. |
Great find! It'll be nice to see some error messages come through. The lack of them has made this difficult to debug.
That would make sense. The console and journal logs are stopping at the same point each time meaning it could be exceeding the memory limit consistently at the same place. |
Right now when QEMU gets killed during a testiso run, the error is: FAIL: pxe-offline-install (bios + metal) (1m10.277s) Got EOF from completion channel, coreos-installer-test-OK expected This is accurate but doesn't hint well enough at the underlying cause. Rework the two spots in which we wait for virtio-serial strings to also check if QEMU was killed to provide a better error. E.g.: FAIL: pxe-install (bios + metal) (34.483s) QEMU unexpectedly exited while waiting awaiting completion: process killed Related: coreos/fedora-coreos-tracker#1339
ok now we see:
from build#560 |
We're seeing an obscure `Got EOF from completion channel` error when running testISO. The running theory is that we're running out of memory and that seems to be accurate in a test I did earlier today. Let's bump for now and investigate when we have more time. See coreos/fedora-coreos-tracker#1339
We're seeing an obscure `Got EOF from completion channel` error when running testISO. The running theory is that we're running out of memory and that seems to be accurate in a test I did earlier today. Let's bump for now and investigate when we have more time. See coreos/fedora-coreos-tracker#1339
Let's optimistically close this as fixed by coreos/fedora-coreos-pipeline#750. |
I think we still need to do some sort of analysis on why our previous calculations for max memory weren't accurate. |
Agreed. Should we track that in a new issue against the pipeline instead? |
Describe the bug
The pxe-offline test fails intermittently in the testing-devel [x86_64] branch
Build failure - 472
Reproduction steps
Not reproducible. Occurs intermittently
Expected behavior
The coreos-installer-service is expected to write the ignition config and first-boot kernel arguments for the installation to complete and the pxe-offline test to PASS.
Actual behavior
In pxe-offline test failure, it was observed that the coreos-installer-service failed to write the ignition config. Also, on further looking at the journal log it was found that the systemd-hostnamed service got deactivated after the coreos-installer-service completed the Read-disk task.
System details
[testing-devel][x86_64] ⚡ 36.20221028.20.0
Ignition config
No response
Additional information
The coreos-installer-service stops after the Read-disk task is complete.
console.txt :
journal.txt:
The text was updated successfully, but these errors were encountered: