-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent first boot with Proxmox #9852
Comments
Seems I did this to myself, copy/pasta 🤦🏻♂️ 😠 . To fix this, on the Terraform resource this:
to this:
|
This might be fixed by #9810. You can inspect if any of the mounted disks contain a partition with a label |
Here is what I was able to get. Not seeing a partition with a label META at all 🤔 , at least not as seen in the docs Discovering Volumes.
|
Once it's there, it will fail to boot, so you won't see it on a successful run. Anyways worth checking with Talos 1.9 which has a fix |
Out of curiosity, is there a way to inspect disks when in the looping state? Whatever issue I'm hitting must be unique to me as I don't see a lot of others hitting this issue in Proxmox like myself. Simply looking to better understand so that I can attempt to fix this issue on my end. Thanks! |
Try 1.9.0-beta.0 please. When it fails, you should see Talos discover |
Same issue with |
If that fails, can you please share full serial console logs of the boot process? On Proxmox it should be easy, but by default Talos serial console logs are disabled, so you might need to add to the extra kernel args (e.g. via Image Factory): |
For others to follow:
|
I'm totally lost right now as to how this can be. The disk in question is (most probably):
Looks like it's some kind of a disk image with GPT being smaller than the whole disk. Is this Talos disk image? Talos finds the
But after that something else goes terribly wrong (need to look more to understand):
|
Doesn't fix anything, but the hope is that it would help with siderolabs#9852 and siderolabs#9786. Signed-off-by: Andrey Smirnov <[email protected]>
Doesn't fix anything, but the hope is that it would help with siderolabs#9852 and siderolabs#9786. Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit b15917e)
Some more details, in my Terraform config that I use to build the nodes, I use cloud-init to set the hostname and IP config. When using cloud-init, Proxmox leverages a cloud-init disk. Additionally, I specify an EFI disk (though not required). That said, the nodes have three disks (the one I specify for talos In an attempt to make progress, I removed the cloud-init config and EFI disk and as I result the nodes started booting successfully. However, as I refined the IaC and after a few rebuilds (all successful one after the next with no changes to the IaC) the nodes began boot looping. At this point I suspect something with my infrastructure 🤷🏻 . @smira, I noticed you added some debugging. How can I leverage this? |
You can simply pull the console logs. Boot with |
Would this be with the latest v1.9.0 release? |
yes! |
Ok, took a while for the issue to come back. Below is the console output from v1.9.0. Thanks for taking a look.
|
Thanks for capturing this... My guess is that So if you keep trying to install Talos on non-empty disks, and disks wiped in a bad way, too many fun things might happen. I'll take a look if we can mitigate that, but correctly wiping the disk on your side would be a solution. |
So from the log it's clear that GPT contains a smaller-than-expected GPT, but Talos (I need to verify that) falls back to a backup GPT (at the end of the disk), and that backup GPT is from Talos itself. And then there's a disagreement between a kernel probed partition structure and what Talos assumes is a partition structure. |
I kept trying to reproducing this with any kind of broken GPT, but I still can't. If you have a chance to dump fully this |
1. Push GPT discovery last in the chain - if the GPT image is overridden with a smaller filesystem image, it might detect GPT before actual filesystem. 2. Fix GPT secondary header signature calculation - it might be not the end of the disk. 3. Add more tests for wipe by signatures and detection. Related to, but doesn't fix siderolabs/talos#9852 Signed-off-by: Andrey Smirnov <[email protected]>
1. Push GPT discovery last in the chain - if the GPT image is overridden with a smaller filesystem image, it might detect GPT before actual filesystem. 2. Fix GPT secondary header signature calculation - it might be not the end of the disk. 3. Add more tests for wipe by signatures and detection. Related to, but doesn't fix siderolabs/talos#9852 Signed-off-by: Andrey Smirnov <[email protected]>
This wasn't supposed to be a fix yet. |
Can you clarify on this or provide a direction for how I'd do this? I don't need proxmox specific directions, just trying to understand what you're asking. Thanks! |
I mean if you could do something like |
Bug Report
Description
I am building a Talos cluster (v1.8.3) on Proxmox (v8.2.2) with Terraform (v1.9.5). Sometimes, on first boot after provisioning, the machines boot to the expected Talos UI and others times they boot and throw the error
error running phase 6 in initialize sequence: task 1/1: failed, unexpected EOF
followed byfailed to revert bootloader: unexpected EOF
, then the system reboots and loops this error. I have completely destroyed the terraform managed infrastructure (i.e. Proxmox vm, disks etc) and retried with mixed results. Sometimes they boot as expected and other times they throw the previously mentioned error.I've read in the following issues that I may need certain image extensions, so I have included
i915-ucode
,intel-ucode
, andmei
in my image factory generated Talos image, same error and boot loop behavior. I've confirmed theid
generated withtalos_image_factory_schematic
matches theid
I received from the Image Factory if performed manually.Observations
I am leveraging a nocloud image so that I may set an IP and Gateway IP address. When running a ping to one of the machines, on first boot the machine does not respond. On second boot right before it enters the rebooting countdown the machine does respond. Only mentioning this as I am unsure if my problem is related to something with nocloud / cloudinit.
IaC
Logs
Environment
The text was updated successfully, but these errors were encountered: