-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoT reports timeout if not power-cycled after update #1451
Comments
I can't reproduce this using
I'm going to look into what |
Wicket is doing basically the same thing here. However, it may be doing it faster. After resetting the RoT, we need to sleep for ~3 seconds before it will reply; we'll see the
|
I was able to reproduce this by putting I then decided to run
I then ran
|
Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image. |
Even stranger, I try to reflash via humility and am seeing this:
|
AHA! What happened was that I ran the following:
The rot update to slot b failed with a FlowError. But faux-mgs doesn't report that in the shell as an error so then I switched to slot B forcefully but it was corrupted. The weird thing is that I reflash slot A from humility, but it didn't appear to get booted into directly, even though B was corrupt. Maybe that's because it was just stuck after the kernel panic. |
Welp, A new wrinkle has arisen. I have removed the PEBKAC by not switching to the image I am purposefully corrupting. So I ran in an install via faux-mgs of slot A, which failed. I then reset and ended back up in a kernel panic, same as above. However, slot B is valid and the persistent boot preference. Power-cycling by removing the power cable did not fix the issue even. So I reflashed slot A via humility, and then when the system came back it was in slot B. You can see this after I flashed slot A and did nothing else:
My best guess is that we are crashing because the romapi is not doing it's job in |
I do have #1504 which uses our flash driver for checking if things are unprogrammed if you want to rule out the ROM API |
Oooh yeah this could definitely overflow if we're using a corrupt image. I think the correct fix is to check the magic before trying to access the length and also switch to the check APIs for doing math. |
Just commenting here that we see this in rack after doing an RoT-only reset via pilot: after doing this we get the timeouts:
and after a power cycle it's back to working:
|
Confirming @nathanaelhuffman's observation. During the mupdate of the PVT2 rack, I updated the RoTs of all 16 sleds, 1 psc, and 2 sidecars, and all 19 went into |
Testing on Simply resetting a locked RoT does not reproduce the issue, and resetting an unlocked RoT with a CFPA update staged doesn't either (see the very top of this thread). Once the RoT is in this bad state, sending a CS pulse (with Details in Matrix starting about here |
Fixed by #1518. Cause was that the standard system reset did not clear the flash lock registers so we have to use a different reset. This was mentioned in one line of the 1229 page manual (im not bitter not at all). |
After using wicket to update an RoT (from slot A to slot B), we observe that the RoT reports a timeout when attempting to read its status. After using ignition to power-cycle of the whole board, it properly comes up as running from slot B.
The text was updated successfully, but these errors were encountered: