Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoT reports timeout if not power-cycled after update #1451

Closed
mkeeter opened this issue Jun 30, 2023 · 13 comments
Closed

RoT reports timeout if not power-cycled after update #1451

mkeeter opened this issue Jun 30, 2023 · 13 comments
Milestone

Comments

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 30, 2023

After using wicket to update an RoT (from slot A to slot B), we observe that the RoT reports a timeout when attempting to read its status. After using ignition to power-cycle of the whole board, it properly comes up as running from slot B.

@mkeeter
Copy link
Collaborator Author

mkeeter commented Jul 5, 2023

I can't reproduce this using faux-mgs on the benchtop.

faux-mgs update rot 1 /Users/mjk/oxide/hubris/target/rot-carrier/dist/b/build-rot-carrier-image-b.zip
faux-mgs state # works fine
faux-mgs component-active-slot rot --set 1 --persist
faux-mgs state # still works fine, shows B as pending persistent boot preference
faux-mgs reset-component rot
faux-mgs state # shows B as active and persistent boot preference

I'm going to look into what wicket is actually doing here and see if it's different.

@mkeeter
Copy link
Collaborator Author

mkeeter commented Jul 10, 2023

Wicket is doing basically the same thing here. However, it may be doing it faster.

After resetting the RoT, we need to sleep for ~3 seconds before it will reply; we'll see the Timeout error message until then:

./target/debug/faux-mgs --interface en0 reset-component rot && ./target/debug/faux-mgs --interface en0 state
Jul 10 14:01:10.959 INFO creating SP handle on interface en0, component: faux-mgs
Jul 10 14:01:12.048 INFO initial discovery complete, addr: [fe80::c1d:ddff:fef8:fb69%15]:11111, interface: en0, component: faux-mgs
Jul 10 14:01:12.051 INFO SP is repared to reset component rot, component: faux-mgs
Jul 10 14:01:12.065 INFO SP reset component rot complete, component: faux-mgs
reset complete
Jul 10 14:01:12.081 INFO creating SP handle on interface en0, component: faux-mgs
Jul 10 14:01:12.086 INFO initial discovery complete, addr: [fe80::c1d:ddff:fef8:fb69%15]:11111, interface: en0, component: faux-mgs
Jul 10 14:01:12.132 INFO V2(SpStateV2 { hubris_archive_id: [75, 16, 227, 121, 150, 189, 78, 158], serial_number: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], model: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], revision: 0, base_mac_address: [14, 29, 221, 248, 251, 105], power_state: A2, rot: Err(Sprot(Timeout)) }), component: faux-mgs
hubris archive: 4b10e37996bd4e9e
serial number:
model:
revision: 0
base MAC address: 0e:1d:dd:f8:fb:69
power state: A2
RoT state: Err(Sprot(Timeout))

@andrewjstone
Copy link
Contributor

I was able to reproduce this by putting ping back in the rot-carrier image for some chaos monkey behavior while performing an update. I get an Err(Sprot(Timeout)) every time from faux-mgs. Interestingly, looking at the ringbuf, the RoT side of sprot was not getting any messages!

I then decided to run tasks on the rot-carrier and low and behold:

➜  hubris git:(master) ✗ rota tasks
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility tasks failed: kernel has panicked on boot: "panicked at 'attempt to add with overflow', lib/lpc55-rot-startup/src/images.rs:108:13"
➜  hubris git:(master) ✗ rotb tasks
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility tasks failed: kernel has panicked on boot: "panicked at 'attempt to add with overflow', lib/lpc55-rot-startup/src/images.rs:108:13"

rota and rotb are just humility aliases pointing to the correct images.

I then ran rota registers and saw the kernel has indeed panicked.

➜  hubris git:(master) ✗ rota registers
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
   R0 = 0x00000000
   R1 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
   R2 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
   R3 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
   R4 = 0x00050000
   R5 = 0x00050130
   R6 = 0x00000000
   R7 = 0x20004278 <- kernel: 0x20004000+0x278
   R8 = 0x00000000
   R9 = 0x00050000
  R10 = 0x00000000
  R11 = 0x00000000
  R12 = 0x20004458 <- kernel: KERNEL_EPITAPH+0x4f
   SP = 0x20004248 <- kernel: 0x20004000+0x248
   LR = 0x00012451 <- kernel: write_str+0x35
   PC = 0x0001276c <- kernel: die_impl+0x44
  PSR = 0x69000000 <- 0110_1001_0000_0000_0000_0000_0000_0000
                      |||| | ||         |       |           |
                      |||| | ||         |       |           + Exception = 0x0
                      |||| | ||         |       +------------ IC/IT = 0x0
                      |||| | ||         +-------------------- GE = 0x0
                      |||| | |+------------------------------ T = 1
                      |||| | +------------------------------- IC/IT = 0x0
                      |||| +--------------------------------- Q = 1
                      |||+----------------------------------- V = 0
                      ||+------------------------------------ C = 1
                      |+------------------------------------- Z = 1
                      +-------------------------------------- N = 0

  MSP = 0x20004248 <- kernel: 0x20004000+0x248
  PSP = 0x00000000
  SPR = 0x00000000 <- 0000_0000_0000_0000_0000_0000_0000_0000
                            |||         |         |         |
                            |||         |         |         + PRIMASK = 0
                            |||         |         +---------- BASEPRI = 0x0
                            |||         +-------------------- FAULTMASK = 0
                            ||+------------------------------ CONTROL.nPRIV = 0
                            |+------------------------------- CONTROL.SPSEL = 0
                            +-------------------------------- CONTROL.FPCA = 0

FPSCR = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56

@andrewjstone
Copy link
Contributor

Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image.

@andrewjstone
Copy link
Contributor

Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image.

Even stranger, I try to reflash via humility and am seeing this:

➜  hubris git:(master) ✗ rota flash
humility: attaching with chip set to "LPC55S69JBD100"
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility flash failed: archive appears to be already flashed on attached device; use -F ("--force") to force re-flash or -V ("--verify") to verify contents
➜  hubris git:(master) ✗ rotb flash
humility: attaching with chip set to "LPC55S69JBD100"
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility flash failed: archive appears to be already flashed on attached device; use -F ("--force") to force re-flash or -V ("--verify") to verify contents

@andrewjstone
Copy link
Contributor

AHA! What happened was that I ran the following:

 target/release/faux-mgs --interface igb0 state && target/release/faux-mgs --interface igb0 update rot 1 ~/build-rot-carrier-image-b.zip && target/release/faux-mgs --interface igb0 component-active-slot -s 1 -p rot && target/release/faux-mgs --interface igb0 reset-component rot

The rot update to slot b failed with a FlowError. But faux-mgs doesn't report that in the shell as an error so then I switched to slot B forcefully but it was corrupted. The weird thing is that I reflash slot A from humility, but it didn't appear to get booted into directly, even though B was corrupt. Maybe that's because it was just stuck after the kernel panic.

@andrewjstone
Copy link
Contributor

Welp, A new wrinkle has arisen. I have removed the PEBKAC by not switching to the image I am purposefully corrupting. So I ran in an install via faux-mgs of slot A, which failed. I then reset and ended back up in a kernel panic, same as above. However, slot B is valid and the persistent boot preference. Power-cycling by removing the power cable did not fix the issue even. So I reflashed slot A via humility, and then when the system came back it was in slot B. You can see this after I flashed slot A and did nothing else:

➜  hubris git:(master) ✗ gimletlet hiffy -c SpRot.rot_boot_info
humility: attached to 0483:3754:005100383431511731343632 via ST-Link V3
SpRot.rot_boot_info() => RotBootInfo { active: B, persistent_boot_preference: B, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([ 0x9b, 0x9b, 0xa0, 0x2a, 0x94, 0x49, 0xf9, 0x49, 0x5e, 0xd8, 0x34, 0x3f, 0xe7, 0x94, 0xa, 0x7e, 0x0, 0x99, 0x6c, 0x91, 0xe3, 0x3e, 0xaf, 0xcf, 0xa, 0xfd, 0x8b, 0xb9, 0x9d, 0x4, 0x36, 0x84 ]), slot_b_sha3_256_digest: Some([ 0x3, 0x98, 0x12, 0xd0, 0x5c, 0x8d, 0xc4, 0xfb, 0xd3, 0x13, 0x32, 0x85, 0x1, 0xa7, 0xbc, 0x9d, 0x4a, 0xb7, 0x50, 0x3c, 0x78, 0x1d, 0x69, 0xe5, 0xbd, 0x2f, 0x49, 0x25, 0x12, 0x88, 0x6a, 0xbc ]) }

My best guess is that we are crashing because the romapi is not doing it's job in lpc55-rot-startup and is returning a header as programmed when it isn't.

@labbott
Copy link
Collaborator

labbott commented Aug 25, 2023

I do have #1504 which uses our flash driver for checking if things are unprogrammed if you want to rule out the ROM API

@labbott
Copy link
Collaborator

labbott commented Aug 25, 2023

        let valid = lpc55_romapi::validate_programmed(
            img_start,
            (header.total_image_len + (PAGE_SIZE - 1)) & !(PAGE_SIZE - 1),
        );

Oooh yeah this could definitely overflow if we're using a corrupt image. I think the correct fix is to check the magic before trying to access the length and also switch to the check APIs for doing math.

@nathanaelhuffman
Copy link
Contributor

Just commenting here that we see this in rack after doing an RoT-only reset via pilot:
pilot sp exec -e 'reset-component rot' BRM27230001

after doing this we get the timeouts:

nathanael@drteeth ~ $ pilot sp info  BRM27230001
hubris archive: 6cb4ed394fc7b36c
serial number: BRM27230001
model: 913-0000019
revision: 13
base MAC address: a8:40:25:04:05:c1
power state: A0
RoT state: Err(Sprot(Timeout))

and after a power cycle it's back to working:

nathanael@drteeth ~ $ pilot sp info  BRM27230001                                                  hubris archive: 6cb4ed394fc7b36c
serial number: BRM27230001
model: 913-0000019
revision: 13
base MAC address: a8:40:25:04:05:c1
power state: A0
RoT state: Ok(RotStateV2 { active: B, persistent_boot_preference: B, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([225, 90, 3, 166, 133, 26, 105, 129, 123, 135, 6, 55, 26, 48, 101, 113, 101, 162, 203, 199, 162, 151, 250, 188, 164, 159, 205, 201, 142, 227, 76, 46]), slot_b_sha3_256_digest: Some([246, 2, 155, 45, 220, 12, 152, 100, 177, 189, 42, 86, 113, 96, 65, 26, 74, 86, 216, 206, 179, 138, 198, 145, 18, 213, 189, 132, 235, 7, 140, 213]) })

@morlandi7 morlandi7 added this to the MVP milestone Aug 29, 2023
@jgallagher
Copy link
Contributor

Confirming @nathanaelhuffman's observation. During the mupdate of the PVT2 rack, I updated the RoTs of all 16 sleds, 1 psc, and 2 sidecars, and all 19 went into Sprot(Timeout) following the reset-after-update. Ignition power cycling each component brought it back (and the RoT booted into the slot we expected following the update).

@mkeeter
Copy link
Collaborator Author

mkeeter commented Sep 1, 2023

Testing on animal and a spare locked Gimlet, I can get this to happen 100% of the time by resetting a locked RoT (SpRot.reset) with a CFPA update staged (i.e. we have written a new CFPA to the scratch page and are waiting for the ROM to copy it over, using SpRot.switch_default_image -aslot=A,duration=Forever).

Simply resetting a locked RoT does not reproduce the issue, and resetting an unlocked RoT with a CFPA update staged doesn't either (see the very top of this thread).

Once the RoT is in this bad state, sending a CS pulse (with SpRot.pulse_cs -adelay=10) does not recover it. This lends credence to the theory that it's not making it into Hubris at all.

Details in Matrix starting about here

@labbott
Copy link
Collaborator

labbott commented Sep 14, 2023

Fixed by #1518. Cause was that the standard system reset did not clear the flash lock registers so we have to use a different reset. This was mentioned in one line of the 1229 page manual (im not bitter not at all).

@labbott labbott closed this as completed Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants