Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP-RoT link ends up in a persistent bad state #1507

Closed
mkeeter opened this issue Aug 23, 2023 · 21 comments
Closed

SP-RoT link ends up in a persistent bad state #1507

mkeeter opened this issue Aug 23, 2023 · 21 comments
Assignees

Comments

@mkeeter
Copy link
Collaborator

mkeeter commented Aug 23, 2023

While updating the colo rack with the dogfood image (https://github.com/oxidecomputer/omicron/runs/15999510273), we updated both the SP and RoT images using mupdate / wicket.

(Normally, we skip updating the RoTs during mupdate, because they haven't changed in a while)

Augustus noticed one RoT showing a failure to communicate after the initial update; a second update brought this number to three RoTs.

Here's an example of an unhappy RoT:
image(1)

For contrast, a happy RoT:
image(2)

The ringbuf shows invalid data received by the SP when we try to request the RoT status:

humility: ring buffer drv_stm32h7_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
   0  241        1        1 Sent(0xa)
   1  283        1        1 Received(0x8)
   2  437        1        1 RxBuf([ 0x1, 0x0, 0x2, 0xe4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   3  447        1        1 Error(Protocol(InvalidCrc))
   4  241        1        1 Sent(0xa)
   5  283        1        1 Received(0x8)
   6  437        1        1 RxBuf([ 0x1, 0x0, 0x2, 0xe4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   7  447        1        1 Error(Protocol(InvalidCrc))
   8  241        1        1 Sent(0xa)
   9  283        1        1 Received(0x8)
  10  437        1        1 RxBuf([ 0x1, 0x0, 0x2, 0xe4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  11  447        1        1 Error(Protocol(InvalidCrc))
  12  457        1        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }

We would expect the RxBuf messages to begin with [3, 0, 0, 0], because that's the current protocol version serialized with u32::to_le_bytes().

Looking at the spi_server_core ringbuf, we do see that pattern:

humility: ring buffer drv_stm32h7_spi_server_core::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  50  494        3        2 WaitISR(0x70012)
  51  464        3        1 Rx(0x0)
  52  494        3        2 WaitISR(0x60012)
  53  464        3        1 Rx(0x3)
  54  494        3        2 WaitISR(0x50012)
  55  464        3        1 Rx(0x0)
  56  494        3        2 WaitISR(0x40012)
  57  464        3        1 Rx(0x0)
  58  494        3        2 WaitISR(0x30012)
  59  464        3        1 Rx(0x0)
  60  494        3        2 WaitISR(0x20012)
  61  464        3        1 Rx(0x3)
  62  494        3        2 WaitISR(0x10012)
  63  464        3        1 Rx(0x0)
   0  322        4        1 Start(read, (0x0, 0x10))
   1  430        4        8 Tx(0x0)
   2  494        4        2 WaitISR(0x100000)
   3  464        4        1 Rx(0x1)
   4  430        4        1 Tx(0x0)
   5  494        4        1 WaitISR(0xf0000)
   6  494        4        1 WaitISR(0xf0002)
   7  464        4        1 Rx(0x0)
   8  430        4        1 Tx(0x0)
   9  494        4        1 WaitISR(0xe0000)
  10  494        4        1 WaitISR(0xe0002)
  11  464        4        1 Rx(0x2)
  12  430        4        1 Tx(0x0)
  13  494        4        1 WaitISR(0xd0000)
  14  494        4        1 WaitISR(0xd0002)
  15  464        4        1 Rx(0xe4)
  16  430        4        1 Tx(0x0)
  17  494        4        1 WaitISR(0xc0000)
  18  494        4        1 WaitISR(0xc0002)
  19  464        4        1 Rx(0x0)
  20  430        4        1 Tx(0x0)
  21  494        4        1 WaitISR(0xb0000)
  22  494        4        1 WaitISR(0xb0002)
  23  464        4        1 Rx(0x0)
  24  430        4        1 Tx(0x0)
  25  494        4        1 WaitISR(0xa0000)
  26  494        4        1 WaitISR(0xa0002)
  27  464        4        1 Rx(0x0)
  28  430        4        1 Tx(0x0)
  29  494        4        1 WaitISR(0x90000)
  30  494        4        1 WaitISR(0x90002)
  31  464        4        1 Rx(0x0)
  32  430        4        1 Tx(0x0)
  33  494        4        1 WaitISR(0x80010)
  34  494        4        1 WaitISR(0x80012)
  35  464        4        1 Rx(0x0)
  36  494        4        2 WaitISR(0x70012)
  37  464        4        1 Rx(0x0)
  38  494        4        2 WaitISR(0x60012)
  39  464        4        1 Rx(0x0)
  40  494        4        2 WaitISR(0x50012)
  41  464        4        1 Rx(0x0)
  42  494        4        2 WaitISR(0x40012)
  43  464        4        1 Rx(0x0)
  44  494        4        2 WaitISR(0x30012)
  45  464        4        1 Rx(0x0)
  46  494        4        2 WaitISR(0x20012)
  47  464        4        1 Rx(0x0)
  48  494        4        2 WaitISR(0x10012)
  49  464        4        1 Rx(0x0)

However, it occurs before the Start log in the ringbuf! That's weird and surprising. Because we're retrying three times, it's likely from a previous message; unfortunately, the ringbuf isn't long enough to record all three attempts.

This looks like a framing issue where the SP and RoT have gotten out of sync. There are FIFO buffers on each side's SPI peripheral. Resetting the SP did not fix the issue; power-cycling the whole sled (via Ignition) does fix the issue. This leads us to suspect the RoT's Tx FIFO.

(In particular, I'm wondering what happens if the RoT begins listening to SPI midway through an attempted transaction from the SP, since the SP will try talking to it unconditionally)

@jgallagher
Copy link
Contributor

Possibly related, although a different error message: #1451

@mkeeter
Copy link
Collaborator Author

mkeeter commented Aug 23, 2023

Yeah, thanks for linking that!

I'm not sure if they're related; a timeout occurs if the RoT never asserts ROT_TO_SP_IRQ_L, versus sending corrupt data over SPI.

On the other hand, if we're blaming FIFOs for framing issues, we could hypothesize

  • This issue is caused by FIFO shenanigans in the RoT → SP direction, so the SP sees corrupt messages
  • RoT reports timeout if not power-cycled after update #1451 is caused by FIFO shenanigans in the SP → RoT direction, so the RoT sees corrupt messages (and never pulls the IRQ line low, because it can't reply)

(This is extremely hand-wavey / unjustified, e.g. I can't remember if the RoT replies with an error on a corrupt message, but just putting it out there)

@lzrd
Copy link
Contributor

lzrd commented Aug 23, 2023

Re "This looks like a framing issue where the SP and RoT have gotten out of sync. ..."

SPI messages are framed by the SPI bus CSn signal.

@lzrd
Copy link
Contributor

lzrd commented Aug 23, 2023

Note: When SP does a flush (SPI CSn deassert, 10ms delay with no SPI clocks, CSn assert), the SP and RoT communications should be re-synchronized if they ever fell out of sync.

ajs and I don't see anything wrong there yet.

@lzrd
Copy link
Contributor

lzrd commented Aug 24, 2023

Is there any chance that the LPC55S69 parts are programmed to operate at 48MHz instead of the required 96MHz?
I don't think that is likely. But it can cause intermittent failures. Need to double check CMPA programming.

@lzrd
Copy link
Contributor

lzrd commented Aug 28, 2023

I did some SysTick (1ms) resolution ringbuf logging that doesn't immediately shed any light.

humility: ring buffer drv_lpc55_sprot_server::TIMES in sprot:
humility: attached to 1fc9:0143:MMLBHACB24IDJ via CMSIS-DAP
 NDX LINE      GEN    COUNT PAYLOAD

  39  185       89        1 Now(0x34310) // top of loop
  40  231       89        1 Now(0x34310) // wait_for_csn_asserted(false), notified.
  41  264       89        1 Now(0x34318) // wait_for_request(): read some, ssd
  42  276       89        1 Now(0x34318) // wait_for_request(): read remainder
  43  287       89        1 Now(0x34318) // received SP's message
  44  188       89        1 Now(0x34318) // message ok, sending to handler
  45  207       89        1 Now(0x3431b) // message handled
  46  209       89        1 Now(0x3431b) // ringbuf_entry!(Trace::Stats done
  47  311       89        1 Now(0x3431b) // reply: begining reply
  48  231       89        2 Now(0x3431b) // wait_for_csn_asserted(true), ROT_IRQ, then notified.
  49  322       89        1 Now(0x3431e) // reply: Tx loop done
  50  352       89        1 Now(0x3431e) // reply: FIFOs ready for next request.
  51  185       89        1 Now(0x3431e) // top of loop
  52  231       89        1 Now(0x3431e) // wait_for_csn_asserted

@andrewjstone
Copy link
Contributor

I was unable to reproduce on madrid on tuesday, however today I reproduced a single time on my gimletlet and rot-carrier. Reproduction here means that we get an endless stream of CrcError responses from faux-mgs state and the system never self corrects. Here are my raw notes from tonight's debugging session:


Reproduced on the gimletlet/rot-carrier with Keeter's recommendation

Spam faux-mgs state requests and reset RoT via debugger

Specifically, this command was run on my helios box:

ajs@tarz:~/management-gateway-service$ while true; do target/release/faux-mgs --interface igb0 state && sleep 0.1; done

Then on my mac: rota reset

Here's the relevant RoT ringbuf:

➜  hubris git:(hubris-1507-debugging) ✗ rota ringbuf
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  27  356       66        1 Underrun
  28  316       66        1 SentBytes(0x0)
  29  267       66        1 ReceivedBytes(0x10)
  30   91       66        1 Err(UnsupportedProtocol)
  31  194       66        1 Stats(RotIoStats { rx_received: 0x15f, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x15f, rx_invalid: 0x15f, tx_incomplete: 0x0 })
   0  277       67        1 ReplyLen(0xb)
   1  356       67        1 Underrun
   2  316       67        1 SentBytes(0x0)
   3  267       67        1 ReceivedBytes(0x10)
   4   91       67        1 Err(UnsupportedProtocol)
   5  194       67        1 Stats(RotIoStats { rx_received: 0x160, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x160, rx_invalid: 0x160, tx_incomplete: 0x0 })
   6  277       67        1 ReplyLen(0xb)
   7  356       67        1 Underrun
   8  316       67        1 SentBytes(0x0)
   9  267       67        1 ReceivedBytes(0x10)
  10   91       67        1 Err(UnsupportedProtocol)
  11  194       67        1 Stats(RotIoStats { rx_received: 0x161, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x161, rx_invalid: 0x161, tx_incomplete: 0x0 })
  12  277       67        1 ReplyLen(0xb)
  13  356       67        1 Underrun
  14  316       67        1 SentBytes(0x0)
  15  267       67        1 ReceivedBytes(0x10)
  16   91       67        1 Err(UnsupportedProtocol)
  17  194       67        1 Stats(RotIoStats { rx_received: 0x162, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x162, rx_invalid: 0x162, tx_incomplete: 0x0 })
  18  277       67        1 ReplyLen(0xb)
  19  356       67        1 Underrun
  20  316       67        1 SentBytes(0x0)
  21  267       67        1 ReceivedBytes(0x10)
  22   91       67        1 Err(UnsupportedProtocol)
  23  194       67        1 Stats(RotIoStats { rx_received: 0x163, rx_overrun: 0x2, csn_pulses: 0x0, tx_underrun: 0x163, rx_invalid: 0x163, tx_incomplete: 0x0 })
  24  277       67        1 ReplyLen(0xb)
  25  356       67        1 Underrun
  26  316       67        1 SentBytes(0x0)

What's odd is that consistently, UnsupportedProtocol is the error on the RoT
because of an overrun, and it returns this. The first 6 bytes come back on the
SP, but the remaining 5 are 0 resulting in a CRC error. This is the same on every iteration.

It's like the RoT is stuck and not actually mutating its buffers.

Here's the SP state for the last retry. The 0x17.. are timestamps in ticks (ms).


 22  596      276        1 ReqRotBootInfo
  23  218      276        1 DoSendRecv(0x17a1e5)
  24  346      276        1 RotIrqNotAsserted
  25  262      276        1 Sent(0xa)
  26  399      276        1 WaitIrqStarted(true, 0x17a1e5)
  27  413      276        1 WaitIrqStopped(true, 0x17a1e7)
  28  304      276        1 Received(0xb)
  29  470      276        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  30  480      276        1 Error(Protocol(InvalidCrc))
  31  218      276        1 DoSendRecv(0x17a1f0)
  32  346      276        1 RotIrqNotAsserted
  33  262      276        1 Sent(0xa)
  34  399      276        1 WaitIrqStarted(true, 0x17a1f0)
  35  413      276        1 WaitIrqStopped(true, 0x17a1f2)
  36  304      276        1 Received(0xb)
  37  470      276        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  38  480      276        1 Error(Protocol(InvalidCrc))
  39  218      276        1 DoSendRecv(0x17a1fb)
  40  346      276        1 RotIrqNotAsserted
  41  262      276        1 Sent(0xa)
  42  399      276        1 WaitIrqStarted(true, 0x17a1fb)
  43  413      276        1 WaitIrqStopped(true, 0x17a1fd)
  44  304      276        1 Received(0xb)
  45  470      276        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  46  480      276        1 Error(Protocol(InvalidCrc))
  47  490      276        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }

The first 6 bytes indicate an error response for UnsupportedProtocol at
protocol version 3. However, the remaining bytes aren't filled in. The last 2
bytes should be the CRC, but they are 0.

I hooked up the salae and I can see that the RoT is raising ROT_IRQ before it
sees CSn de-asserted. It looks like this happens at byte 9 of the response
from the RoT, and then 6 bytes get clocked out after. Interestingly this would
indeed make the checksums 0, which they also are on the wire and result in the
SP seeing a CRC error. However, it's still unclear to me why the RoT thinks
it's seeing an UnsupportedProtocol. The message from the SP on the wire is correct.

What could explain both these things: early ROT_IRQ raise and invalid read is
if perhaps CSN being de-asserted is misdetected/ tracked wrong in a variable.
That's what I'll be looking into in the code next.

We also see that the RoT is returning 0 bytes on reply and detecting an underrun. It's
also receiving 16 bytes, which should only occur when the SP is waiting for a reply. So clearly
there is a persistent desynchronization, but the SP also should only wait for reply
when ROT_IRQ is asserted, which I don't see in the trace. However, it looks like
we don't de-assert ROT_IRQ at RoT startup and that could cause issues.

Going to add some more debug traces to the RoT side and try to reproduce again.


Having a hard time reproducing again with my added ringbuf traces on the RoT. So
let's do some speculating.

We know from the RoT ringbuf that the RoT receives 16 bytes while
waiting for request. It sees all 16 bytes, and so decodes them as
SprotError::UnknownProtocol. We can assume that this is the SP trying to get
a reply to its prior messsage, as the SP always tries to read a full fifo of 16
bytes. And while waiting for a reply the SP is clocking out zeroes which would
be interpreted by the RoT as UnknownProtocol. However, the SP will only try
to read a reply if it sees ROT_IRQ raised. So the RoT must be asserting ROT_IRQ,
even though it's waiting for a request from the SP and should have ROT_IRQ
deasserted during that time.

We also know that the RoT when trying to reply with this error had not clocked
out more than its fifo, before CSn was deasserted, due to the misleading
Trace::SentBytes(0). This implies that as soon as CSn was asserted and the irq fired in reply
that CSn had also become deasserted.

We then notice in the Rot reply code that we drain the fifo and deassert
ROT_IRQ. At this point we should go back to waiting for the next message in
wait_for_request. But the deassert of ROT_IRQ shows up "early" at the SP and
the SP detects a CRC Error. The SP then retries to send after 5ms and the RoT sees an UnsupportedProtocol,
raises ROT_IRQ and replies, while the SP thinks it is still sending, then the SP goes to receive and gets back partial bytes.

So maybe this is purely just a timing rhythm that kicks off when the RoT starts
at a particular point in time. Here are the hypothesized steps.

  1. The SP sends a status message. It has clocked out most of the bytes, and then de-asserts CSn.
  2. The RoT wakes up, waits for CSn to be asserted by waiting for an IRQ.
  3. Some time passes, and the RoT gets woken up, sees CSn is asserted, then goes into a tight loop to read the message.
  4. The RoT only clocks in a partial message (the last byte or so from the SP request)
  5. The RoT sees that CSn is already de-asserted
  6. The RoT tries to deserialize the reply, sees that it is invalid then calls reply.
  7. By this point, the SP's retry timer has fired causing it to start clocking out the next request
  8. The RoT raises ROT_IRQ, then waits on an IRQ for CSn to be asserted.
  9. CSn is set in the interrupt register on the RoT and the RoT goes to reply.
  10. The RoT only clocks a few bytes out of its reply because it is actually receiving the remainder of a partial SP messaage. This occurs because the RoT sees that CSn is also deasserted via the ssd bit.
  11. The RoT deasserts-rot-irq and goes to wait for the next request.
  12. The SP gets a short reply, with a CRC error then goes to retry.
  13. The cycle repeats..

This is plausible because we see an overrun in the RoT logs indicating a short
read from the SP on every retry and we also saw ROT_IRQ trigger early on the
salae trace, and we know that only partial replies are coming back to the SP,
even though they always start with the right header bytes.

It seems that this is all due to the retry timeout being 5ms and us hitting
that repeatedly. So a poorly chosen timeout results in a loop. This is deeply
unsatisfying as it's very hard to prove its the truth. It also doesn't really
explain why even after the retries are up, that the system never comes back
and continues with the same behavior on the next attempt. Regardless, there are
a few things we should do:

  • Raise the retry timeout
  • If the RoT receives a bad message from the SP (flow error, unsupported
    protocol, etc...), the RoT should NOT send a reply. This just causes the RoT
    to waste time in what will result in a retry anyway. Instead the RoT should go
    back to waiting for the next message and let the SP timeout. This will also have the side effect
    of simplifying the retry error handling on the SP, as now only "unretryable" errors will be returned.
  • Initialize the ROT_IRQ to deasserted on RoT sprot startup. There is no evidence that this is an issue,
    but it should be done.

Now, to help clarify, I'd like to get a reproduction with a salae trace. I'm
going to remove my RoT changes and try again, recording the whole thing from the
start of the SP spam prior to RoT reset.

@andrewjstone
Copy link
Contributor

Ok, reproduced again. Interestingly, a closer look shows that there are a few UnsupportedProtocol reported from the SP to faux-mgs. But we are definitely stuck. This time I got the whole thing recorded on the salae. I Need to figure out where the first error after the timeout starts. I will check that out in the morning. I attached the trace:
SprotInvalidCrc.sal.zip

Here's the full gimletlet ringbuf

➜  hubris git:(hubris-1507-debugging) ✗ gimletlet ringbuf sprot
humility: attached to 0483:3754:005100383431511731343632 via ST-Link V3
humility: ring buffer drv_stm32h7_spi_server_core::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
   7  430     1993        1 Tx(0x0)
   8  430     1993        1 Tx(0x3)
   9  430     1993        1 Tx(0x7)
  10  494     1993        2 WaitISR(0xa0002)
  11  464     1993        1 Rx(0x0)
  12  430     1993        1 Tx(0xa9)
  13  494     1993        2 WaitISR(0x90002)
  14  464     1993        1 Rx(0x0)
  15  430     1993        1 Tx(0x0)
  16  494     1993        2 WaitISR(0x80012)
  17  464     1993        1 Rx(0x0)
  18  494     1993        2 WaitISR(0x70012)
  19  464     1993        1 Rx(0x0)
  20  494     1993        2 WaitISR(0x60012)
  21  464     1993        1 Rx(0x0)
  22  494     1993        2 WaitISR(0x50012)
  23  464     1993        1 Rx(0x0)
  24  494     1993        2 WaitISR(0x40012)
  25  464     1993        1 Rx(0x0)
  26  494     1993        2 WaitISR(0x30012)
  27  464     1993        2 Rx(0x0)
  28  494     1993        2 WaitISR(0x10012)
  29  464     1993        1 Rx(0x0)
  30  322     1993        1 Start(read, (0x0, 0x10))
  31  430     1993        8 Tx(0x0)
  32  494     1993        2 WaitISR(0x100002)
  33  464     1993        1 Rx(0x3)
  34  430     1993        1 Tx(0x0)
  35  494     1993        2 WaitISR(0xf0002)
  36  464     1993        1 Rx(0x0)
  37  430     1993        1 Tx(0x0)
  38  494     1993        3 WaitISR(0xe0002)
  39  464     1993        1 Rx(0x0)
  40  430     1993        1 Tx(0x0)
  41  494     1993        1 WaitISR(0xd0002)
  42  464     1993        1 Rx(0x0)
  43  430     1993        1 Tx(0x0)
  44  494     1993        2 WaitISR(0xc0002)
  45  464     1993        1 Rx(0x3)
  46  430     1993        1 Tx(0x0)
  47  494     1993        1 WaitISR(0xb0002)
  48  464     1993        1 Rx(0x0)
  49  430     1993        1 Tx(0x0)
  50  494     1993        1 WaitISR(0xb0002)
  51  494     1993        2 WaitISR(0xa0002)
  52  464     1993        1 Rx(0x0)
  53  430     1993        1 Tx(0x0)
  54  494     1993        1 WaitISR(0x90002)
  55  464     1993        1 Rx(0x0)
  56  430     1993        1 Tx(0x0)
  57  494     1993        2 WaitISR(0x80012)
  58  464     1993        1 Rx(0x0)
  59  494     1993        2 WaitISR(0x70012)
  60  464     1993        1 Rx(0x0)
  61  494     1993        2 WaitISR(0x60012)
  62  464     1993        1 Rx(0x0)
  63  494     1993        2 WaitISR(0x50012)
   0  464     1994        2 Rx(0x0)
   1  494     1994        2 WaitISR(0x30012)
   2  464     1994        1 Rx(0x0)
   3  494     1994        2 WaitISR(0x20012)
   4  464     1994        1 Rx(0x0)
   5  494     1994        2 WaitISR(0x10012)
   6  464     1994        1 Rx(0x0)
humility: ring buffer drv_stm32h7_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
 244  399       57        1 WaitIrqStarted(true, 0x1b3d0)
 245  413       57        1 WaitIrqStopped(true, 0x1b3d2)
 246  304       57        1 Received(0xb)
 247  470       57        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 248  480       57        1 Error(Protocol(InvalidCrc))
 249  218       57        1 DoSendRecv(0x1b3db)
 250  346       57        1 RotIrqNotAsserted
 251  262       57        1 Sent(0xa)
 252  399       57        1 WaitIrqStarted(true, 0x1b3db)
 253  413       57        1 WaitIrqStopped(true, 0x1b3dd)
 254  304       57        1 Received(0xb)
 255  470       57        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   0  480       58        1 Error(Protocol(InvalidCrc))
   1  218       58        1 DoSendRecv(0x1b3e6)
   2  346       58        1 RotIrqNotAsserted
   3  262       58        1 Sent(0xa)
   4  399       58        1 WaitIrqStarted(true, 0x1b3e6)
   5  413       58        1 WaitIrqStopped(true, 0x1b3e8)
   6  304       58        1 Received(0xb)
   7  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
   8  480       58        1 Error(Protocol(InvalidCrc))
   9  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  10  596       58        1 ReqRotBootInfo
  11  218       58        1 DoSendRecv(0x1b456)
  12  346       58        1 RotIrqNotAsserted
  13  262       58        1 Sent(0xa)
  14  399       58        1 WaitIrqStarted(true, 0x1b456)
  15  413       58        1 WaitIrqStopped(true, 0x1b458)
  16  304       58        1 Received(0xb)
  17  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  18  480       58        1 Error(Protocol(InvalidCrc))
  19  218       58        1 DoSendRecv(0x1b461)
  20  346       58        1 RotIrqNotAsserted
  21  262       58        1 Sent(0xa)
  22  399       58        1 WaitIrqStarted(true, 0x1b461)
  23  413       58        1 WaitIrqStopped(true, 0x1b463)
  24  304       58        1 Received(0xb)
  25  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  26  480       58        1 Error(Protocol(InvalidCrc))
  27  218       58        1 DoSendRecv(0x1b46c)
  28  346       58        1 RotIrqNotAsserted
  29  262       58        1 Sent(0xa)
  30  399       58        1 WaitIrqStarted(true, 0x1b46c)
  31  413       58        1 WaitIrqStopped(true, 0x1b46e)
  32  304       58        1 Received(0xb)
  33  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  34  480       58        1 Error(Protocol(InvalidCrc))
  35  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  36  596       58        1 ReqRotBootInfo
  37  218       58        1 DoSendRecv(0x1b4dc)
  38  346       58        1 RotIrqNotAsserted
  39  262       58        1 Sent(0xa)
  40  399       58        1 WaitIrqStarted(true, 0x1b4dc)
  41  413       58        1 WaitIrqStopped(true, 0x1b4de)
  42  304       58        1 Received(0xb)
  43  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  44  480       58        1 Error(Protocol(InvalidCrc))
  45  218       58        1 DoSendRecv(0x1b4e7)
  46  346       58        1 RotIrqNotAsserted
  47  262       58        1 Sent(0xa)
  48  399       58        1 WaitIrqStarted(true, 0x1b4e7)
  49  413       58        1 WaitIrqStopped(true, 0x1b4e9)
  50  304       58        1 Received(0xb)
  51  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  52  480       58        1 Error(Protocol(InvalidCrc))
  53  218       58        1 DoSendRecv(0x1b4f2)
  54  346       58        1 RotIrqNotAsserted
  55  262       58        1 Sent(0xa)
  56  399       58        1 WaitIrqStarted(true, 0x1b4f2)
  57  413       58        1 WaitIrqStopped(true, 0x1b4f4)
  58  304       58        1 Received(0xb)
  59  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  60  480       58        1 Error(Protocol(InvalidCrc))
  61  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  62  596       58        1 ReqRotBootInfo
  63  218       58        1 DoSendRecv(0x1b562)
  64  346       58        1 RotIrqNotAsserted
  65  262       58        1 Sent(0xa)
  66  399       58        1 WaitIrqStarted(true, 0x1b562)
  67  413       58        1 WaitIrqStopped(true, 0x1b564)
  68  304       58        1 Received(0xb)
  69  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  70  480       58        1 Error(Protocol(InvalidCrc))
  71  218       58        1 DoSendRecv(0x1b56d)
  72  346       58        1 RotIrqNotAsserted
  73  262       58        1 Sent(0xa)
  74  399       58        1 WaitIrqStarted(true, 0x1b56d)
  75  413       58        1 WaitIrqStopped(true, 0x1b56f)
  76  304       58        1 Received(0xb)
  77  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  78  480       58        1 Error(Protocol(InvalidCrc))
  79  218       58        1 DoSendRecv(0x1b578)
  80  346       58        1 RotIrqNotAsserted
  81  262       58        1 Sent(0xa)
  82  399       58        1 WaitIrqStarted(true, 0x1b578)
  83  413       58        1 WaitIrqStopped(true, 0x1b57a)
  84  304       58        1 Received(0xb)
  85  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  86  480       58        1 Error(Protocol(InvalidCrc))
  87  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
  88  596       58        1 ReqRotBootInfo
  89  218       58        1 DoSendRecv(0x1b5e8)
  90  346       58        1 RotIrqNotAsserted
  91  262       58        1 Sent(0xa)
  92  399       58        1 WaitIrqStarted(true, 0x1b5e8)
  93  413       58        1 WaitIrqStopped(true, 0x1b5ea)
  94  304       58        1 Received(0xb)
  95  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
  96  480       58        1 Error(Protocol(InvalidCrc))
  97  218       58        1 DoSendRecv(0x1b5f3)
  98  346       58        1 RotIrqNotAsserted
  99  262       58        1 Sent(0xa)
 100  399       58        1 WaitIrqStarted(true, 0x1b5f3)
 101  413       58        1 WaitIrqStopped(true, 0x1b5f5)
 102  304       58        1 Received(0xb)
 103  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 104  480       58        1 Error(Protocol(InvalidCrc))
 105  218       58        1 DoSendRecv(0x1b5fe)
 106  346       58        1 RotIrqNotAsserted
 107  262       58        1 Sent(0xa)
 108  399       58        1 WaitIrqStarted(true, 0x1b5fe)
 109  413       58        1 WaitIrqStopped(true, 0x1b600)
 110  304       58        1 Received(0xb)
 111  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 112  480       58        1 Error(Protocol(InvalidCrc))
 113  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 114  596       58        1 ReqRotBootInfo
 115  218       58        1 DoSendRecv(0x1b66e)
 116  346       58        1 RotIrqNotAsserted
 117  262       58        1 Sent(0xa)
 118  399       58        1 WaitIrqStarted(true, 0x1b66e)
 119  413       58        1 WaitIrqStopped(true, 0x1b670)
 120  304       58        1 Received(0xb)
 121  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 122  480       58        1 Error(Protocol(InvalidCrc))
 123  218       58        1 DoSendRecv(0x1b679)
 124  346       58        1 RotIrqNotAsserted
 125  262       58        1 Sent(0xa)
 126  399       58        1 WaitIrqStarted(true, 0x1b679)
 127  413       58        1 WaitIrqStopped(true, 0x1b67b)
 128  304       58        1 Received(0xb)
 129  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 130  480       58        1 Error(Protocol(InvalidCrc))
 131  218       58        1 DoSendRecv(0x1b684)
 132  346       58        1 RotIrqNotAsserted
 133  262       58        1 Sent(0xa)
 134  399       58        1 WaitIrqStarted(true, 0x1b684)
 135  413       58        1 WaitIrqStopped(true, 0x1b686)
 136  304       58        1 Received(0xb)
 137  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 138  480       58        1 Error(Protocol(InvalidCrc))
 139  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 140  596       58        1 ReqRotBootInfo
 141  218       58        1 DoSendRecv(0x1b6f4)
 142  346       58        1 RotIrqNotAsserted
 143  262       58        1 Sent(0xa)
 144  399       58        1 WaitIrqStarted(true, 0x1b6f4)
 145  413       58        1 WaitIrqStopped(true, 0x1b6f6)
 146  304       58        1 Received(0xb)
 147  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 148  480       58        1 Error(Protocol(InvalidCrc))
 149  218       58        1 DoSendRecv(0x1b6ff)
 150  346       58        1 RotIrqNotAsserted
 151  262       58        1 Sent(0xa)
 152  399       58        1 WaitIrqStarted(true, 0x1b6ff)
 153  413       58        1 WaitIrqStopped(true, 0x1b701)
 154  304       58        1 Received(0xb)
 155  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 156  480       58        1 Error(Protocol(InvalidCrc))
 157  218       58        1 DoSendRecv(0x1b70a)
 158  346       58        1 RotIrqNotAsserted
 159  262       58        1 Sent(0xa)
 160  399       58        1 WaitIrqStarted(true, 0x1b70a)
 161  413       58        1 WaitIrqStopped(true, 0x1b70c)
 162  304       58        1 Received(0xb)
 163  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 164  480       58        1 Error(Protocol(InvalidCrc))
 165  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 166  596       58        1 ReqRotBootInfo
 167  218       58        1 DoSendRecv(0x1b77a)
 168  346       58        1 RotIrqNotAsserted
 169  262       58        1 Sent(0xa)
 170  399       58        1 WaitIrqStarted(true, 0x1b77a)
 171  413       58        1 WaitIrqStopped(true, 0x1b77c)
 172  304       58        1 Received(0xb)
 173  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 174  480       58        1 Error(Protocol(InvalidCrc))
 175  218       58        1 DoSendRecv(0x1b785)
 176  346       58        1 RotIrqNotAsserted
 177  262       58        1 Sent(0xa)
 178  399       58        1 WaitIrqStarted(true, 0x1b785)
 179  413       58        1 WaitIrqStopped(true, 0x1b787)
 180  304       58        1 Received(0xb)
 181  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 182  480       58        1 Error(Protocol(InvalidCrc))
 183  218       58        1 DoSendRecv(0x1b790)
 184  346       58        1 RotIrqNotAsserted
 185  262       58        1 Sent(0xa)
 186  399       58        1 WaitIrqStarted(true, 0x1b790)
 187  413       58        1 WaitIrqStopped(true, 0x1b792)
 188  304       58        1 Received(0xb)
 189  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 190  480       58        1 Error(Protocol(InvalidCrc))
 191  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 192  596       58        1 ReqRotBootInfo
 193  218       58        1 DoSendRecv(0x1b800)
 194  346       58        1 RotIrqNotAsserted
 195  262       58        1 Sent(0xa)
 196  399       58        1 WaitIrqStarted(true, 0x1b800)
 197  413       58        1 WaitIrqStopped(true, 0x1b802)
 198  304       58        1 Received(0xb)
 199  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 200  480       58        1 Error(Protocol(InvalidCrc))
 201  218       58        1 DoSendRecv(0x1b80b)
 202  346       58        1 RotIrqNotAsserted
 203  262       58        1 Sent(0xa)
 204  399       58        1 WaitIrqStarted(true, 0x1b80b)
 205  413       58        1 WaitIrqStopped(true, 0x1b80d)
 206  304       58        1 Received(0xb)
 207  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 208  480       58        1 Error(Protocol(InvalidCrc))
 209  218       58        1 DoSendRecv(0x1b816)
 210  346       58        1 RotIrqNotAsserted
 211  262       58        1 Sent(0xa)
 212  399       58        1 WaitIrqStarted(true, 0x1b816)
 213  413       58        1 WaitIrqStopped(true, 0x1b818)
 214  304       58        1 Received(0xb)
 215  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 216  480       58        1 Error(Protocol(InvalidCrc))
 217  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }
 218  596       58        1 ReqRotBootInfo
 219  218       58        1 DoSendRecv(0x1b886)
 220  346       58        1 RotIrqNotAsserted
 221  262       58        1 Sent(0xa)
 222  399       58        1 WaitIrqStarted(true, 0x1b886)
 223  413       58        1 WaitIrqStopped(true, 0x1b888)
 224  304       58        1 Received(0xb)
 225  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 226  480       58        1 Error(Protocol(InvalidCrc))
 227  218       58        1 DoSendRecv(0x1b891)
 228  346       58        1 RotIrqNotAsserted
 229  262       58        1 Sent(0xa)
 230  399       58        1 WaitIrqStarted(true, 0x1b891)
 231  413       58        1 WaitIrqStopped(true, 0x1b893)
 232  304       58        1 Received(0xb)
 233  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 234  480       58        1 Error(Protocol(InvalidCrc))
 235  218       58        1 DoSendRecv(0x1b89c)
 236  346       58        1 RotIrqNotAsserted
 237  262       58        1 Sent(0xa)
 238  399       58        1 WaitIrqStarted(true, 0x1b89c)
 239  413       58        1 WaitIrqStopped(true, 0x1b89e)
 240  304       58        1 Received(0xb)
 241  470       58        1 RxBuf([ 0x3, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0 ])
 242  480       58        1 Error(Protocol(InvalidCrc))
 243  490       58        1 FailedRetries { retries: 0x3, last_errcode: Protocol(InvalidCrc) }

Here's the full RoT ringbuf:

➜  hubris git:(hubris-1507-debugging) ✗ rota ringbuf sprot
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  18  356       69        1 Underrun
  19  316       69        1 SentBytes(0x0)
  20  267       69        1 ReceivedBytes(0x10)
  21   91       69        1 Err(UnsupportedProtocol)
  22  194       69        1 Stats(RotIoStats { rx_received: 0x16f, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16a, rx_invalid: 0x16f, tx_incomplete: 0x0 })
  23  277       69        1 ReplyLen(0xb)
  24  356       69        1 Underrun
  25  316       69        1 SentBytes(0x0)
  26  267       69        1 ReceivedBytes(0x10)
  27   91       69        1 Err(UnsupportedProtocol)
  28  194       69        1 Stats(RotIoStats { rx_received: 0x170, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16b, rx_invalid: 0x170, tx_incomplete: 0x0 })
  29  277       69        1 ReplyLen(0xb)
  30  356       69        1 Underrun
  31  316       69        1 SentBytes(0x0)
   0  267       70        1 ReceivedBytes(0x10)
   1   91       70        1 Err(UnsupportedProtocol)
   2  194       70        1 Stats(RotIoStats { rx_received: 0x171, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16c, rx_invalid: 0x171, tx_incomplete: 0x0 })
   3  277       70        1 ReplyLen(0xb)
   4  356       70        1 Underrun
   5  316       70        1 SentBytes(0x0)
   6  267       70        1 ReceivedBytes(0x10)
   7   91       70        1 Err(UnsupportedProtocol)
   8  194       70        1 Stats(RotIoStats { rx_received: 0x172, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16d, rx_invalid: 0x172, tx_incomplete: 0x0 })
   9  277       70        1 ReplyLen(0xb)
  10  356       70        1 Underrun
  11  316       70        1 SentBytes(0x0)
  12  267       70        1 ReceivedBytes(0x10)
  13   91       70        1 Err(UnsupportedProtocol)
  14  194       70        1 Stats(RotIoStats { rx_received: 0x173, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0x16e, rx_invalid: 0x173, tx_incomplete: 0x0 })
  15  277       70        1 ReplyLen(0xb)
  16  356       70        1 Underrun
  17  316       70        1 SentBytes(0x0)

@andrewjstone
Copy link
Contributor

I also noticed, scrolling back in the faux-mgs buffer that the first message after the timeouts from the RoT reset that starts this cycle is RoT state: Err(Sprot(BadMessageLength)

@andrewjstone
Copy link
Contributor

Looks like the RoT is coming out of reset and issuing it's first ROT_IRQ asserted at ~95.056 s on the Salae trace. At what looks like the same time, CSn is getting de-asserted. So it seems like ROT_IRQ is firing early here for some reason. Then it ends early, presumably because it sees CSn deasserted already when replying.

Screenshot 2023-08-31 at 3 15 21 AM Screenshot 2023-08-31 at 3 15 48 AM

On the next status request things look saner from the ROT_IRQ being asserted POV, but it still ends early.
Screenshot 2023-08-31 at 3 17 44 AM

Screenshot 2023-08-31 at 3 18 20 AM

This pattern repeats indefinitely with ROT_IRQ de-asserting early. This is leading me strongly to believe that the SSD bit in the spi status register (CSn de-asserted) is being detected in the reply method of the lpc55-sprot-server even though the SP hasn't yet de-asserted CSn for this reply. This is almost certainly a mismanagement of the ssd bit, and not clearing it when it should be. We should probably guarantee we clear it right before we wait for the SSA (CSn asserted) interrupt.

If I had to point to one suspicious bit in the code it would be the place where we probably do the mishandling of the SSD bit:

while !self.spi.ssd() {
while self.spi.can_tx() {
let entry = get_u16(idx, tx_buf);
self.spi.send_u16(entry);
idx += 2;
}
}

Then right after, we go ahead and drain our TX fifo, which probably triggers the zeroes in the CRC location to show up at the SP instead of the actual CRC:

// Prime our write fifo, so we clock out zero bytes on the next receive
// We also empty our read fifo, since we don't bother reading bytes while writing.
self.spi.drain();
while self.spi.can_tx() {
self.spi.send_u16(0);
}
// We don't bother receiving bytes when sending. So we must clear
// the overrun error condition for the next time we wait for a reply.
self.spi.rxerr_clear();
// Now that we are ready to handle the next request, let the SP know we
// are ready.
self.deassert_rot_irq();

@andrewjstone
Copy link
Contributor

Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request.

@andrewjstone
Copy link
Contributor

andrewjstone commented Aug 31, 2023

Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request.

Hmm, actually it might. We also clear the receive fifo with the self.spi.drain(). However, the SP is still sending 0s here after we de-assert ROT_IRQ. Those end up in the receive fifo, and therefore will come before the next received request. Then what will happen is that the RoT will go back to waiting for the SSA interrupt, but after the few remaining 0s clocked out from the SP, the SP will de-assert CSn causing the SSD bit on the RoT to get set. This happens after we clear SSD in reply on the RoT. Then the next request comes in from the SP, we already have some leading 0s at the RoT and so we know it will be UndefinedProtocol. The RoT gets the SSA interrupt, then it goes into a tight loop, sees that SSD is already set, and reads as subset of the bytes from the SP. The RoT then goes into reply() and starts putting its reply into its TX fifo. Then the RoT asserts ROT_IRQ and waits for the SSA interrupt again. Then the SP finishes sending its request and now de-asserts CSn. When the SSA interrupt fires, SSD will again already be set, and the ROT will return a short response. This will happen perpetually, because of the mismanagement of the SSD bit.

TODO: Figure out how to not misuse the SSD bit on the RoT!

@labbott
Copy link
Collaborator

labbott commented Aug 31, 2023

Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented.

@andrewjstone
Copy link
Contributor

Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented.

Thanks for taking a look @labbott! This matches my understanding of how the SSA/SSD bits work. Note that we do read them from different registers though. The SSA read is from the intstat register, which is destructive, while the SSD bit is from the status register which is non-destructive. This was a change I made while refactoring many months ago, but I don't think it's the cause of the issue or anything. AFAICT the bits act as we understand and as documented. I don't think we can necessarily blame NXP for anything this time 😆

@andrewjstone
Copy link
Contributor

Zooming in on the first trace, shows that right after a timeout SSD fires immediately after ROT_IRQ and sets us on this disastrous path.
Screenshot 2023-09-01 at 12 35 27 PM

@andrewjstone
Copy link
Contributor

The above is presumably because as the RoT comes out of reset, SSA and SSD are already set so it reads what's in its fifo and goes to reply. Then the request from the SP actually ends.

@lzrd
Copy link
Contributor

lzrd commented Sep 6, 2023

Recommendation from our discussion: on start, lpc55 sprot_serverr asserts ROT_IRQ until it has finished initializing FIFOs and SSA/SSD state. This does not fully cover the case where RoT has been reset and the SP steps on FIFO initialization, but it does cover some of that window and makes it much easier to find these failures on a logic analyzer.

@lzrd
Copy link
Contributor

lzrd commented Sep 6, 2023

Recommendation from Matt: after hot path receive, the RoT checks for CSn asserted. If asserted, we have a synchronization error. RoT waits for CSn deasserted, then cleans up, and queues a SYNC_ERROR message, then asserts ROT_IRQ.

@andrewjstone
Copy link
Contributor

andrewjstone commented Sep 7, 2023

I wrote a script that can sometimes reproduce. I'll likely continue tweaking:

#!/usr/bin/env bash


MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}

HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}

# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}


ready_to_reset=false
failures_in_a_row=0

# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
#   humility reset failed: Probe could not be created
#
#  Caused by:
#      0: hidapi error: hid_error is not implemented yet
#  	1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES

while true; do
  # Run our command
  output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
  echo "$output" | grep -q "RoT state: Ok"
  if [[ $? -eq 0 ]]; then
    failures_in_a_row=0
    echo "."
  else
    failures_in_a_row=$((failures_in_a_row+1))
    echo "+"
  fi

  # increment our grace period counter
  reset_grace=$((reset_grace+1))

  # Are we done?
  if [[ failures_in_a_row -eq MAX_FAILS ]]; then
    echo "Test Complete: found faulty sprot"
    break
  fi

  # Should we reset the RoT to try again?
  choice=$RANDOM
  sleep=$((1/(choice % 100 + 1)))
  if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
    # reset our grace period counter
    reset_grace=0
    # Flip a coin to see if we should sleep before we try to reset or after
    # We right shift to create a boolean
    let "choice >>= 14"
    if [[ $choice -eq 0 ]]; then
      HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
    else 
      HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
    fi 
  else
    sleep 0.01
  fi
done

@andrewjstone
Copy link
Contributor

andrewjstone commented Sep 7, 2023

Further validation from a local reproduction that the CSn pin remains asserted when we read it directly after a reply completes. I'm doing this manually after the reproduction where we are stuck in this state, and can see CSn continually asserted. This also means that we can use this pin (and @mkeeter's great suggestion) to detect the stuck state, wait for the actual CSn line to become de-asserted and then return a SYNC_ERROR message.

humility: ring buffer drv_lpc55_sprot_server::__RINGBUF in sprot:
 NDX LINE      GEN    COUNT PAYLOAD
  23  272       43        1 ReceivedBytes(0x10)
  24   91       43        1 Err(UnsupportedProtocol)
  25  195       43        1 Stats(RotIoStats { rx_received: 0xab, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xaa, rx_invalid: 0xab, tx_incomplete: 0x0 })
  26  282       43        1 ReplyLen(0xb)
  27  207       43        1 Pins(One, One)
  28  361       43        1 Underrun
  29  321       43        1 SentBytes(0x0)
  30  207       43        1 Pins(Zero, One)
  31  272       43        1 ReceivedBytes(0x10)
   0   91       44        1 Err(UnsupportedProtocol)
   1  195       44        1 Stats(RotIoStats { rx_received: 0xac, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xab, rx_invalid: 0xac, tx_incomplete: 0x0 })
   2  282       44        1 ReplyLen(0xb)
   3  207       44        1 Pins(One, One)
   4  361       44        1 Underrun
   5  321       44        1 SentBytes(0x0)
   6  207       44        1 Pins(Zero, One)
   7  272       44        1 ReceivedBytes(0x10)
   8   91       44        1 Err(UnsupportedProtocol)
   9  195       44        1 Stats(RotIoStats { rx_received: 0xad, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xac, rx_invalid: 0xad, tx_incomplete: 0x0 })
  10  282       44        1 ReplyLen(0xb)
  11  207       44        1 Pins(One, One)
  12  361       44        1 Underrun
  13  321       44        1 SentBytes(0x0)
  14  207       44        1 Pins(Zero, One)
  15  272       44        1 ReceivedBytes(0x10)
  16   91       44        1 Err(UnsupportedProtocol)
  17  195       44        1 Stats(RotIoStats { rx_received: 0xae, rx_overrun: 0x1, csn_pulses: 0x0, tx_underrun: 0xad, rx_invalid: 0xae, tx_incomplete: 0x0 })
  18  282       44        1 ReplyLen(0xb)
  19  207       44        1 Pins(One, One)
  20  361       44        1 Underrun
  21  321       44        1 SentBytes(0x0)
  22  207       44        1 Pins(Zero, One)

And here's a screenshot of the early ROT_IRQ de-assertion due to desync from the salae trace for this run. This happens indefinitely.
Screenshot 2023-09-07 at 2 19 03 AM

@lzrd
Copy link
Contributor

lzrd commented Sep 7, 2023

Adapted Andrew's script for my environment:

#!/usr/bin/env bash

set -e
START=$(date +%s)
MAX_GRACE_CYCLES=20
# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}
#MAX_FAILS=${MAX_FAILS:=5}

LOG_LEVEL="--log-level critical"

case $USER in
  ajs)
    HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
    export HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}
    rota_reset='humility -t rot-carrier --archive-name a reset'
    INTERFACE=en6
    MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
    FAUX_MGS="${MGS_REPO}/target/release/faux-mgs --interface ${INTERFACE} ${LOG_LEVEL}"
    ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
    ;;
  stoltz)
    FAUX_MGS="$(which faux-mgs) ${LOG_LEVEL}"

    HUBRIS_REPO=$HOME/Oxide/src/hubris
    export HUMILITY_ARCHIVE="$HUBRIS_REPO/target/rot-carrier/dist/a/build-rot-carrier-image-a.zip"
    export HUMILITY_PROBE="1fc9:0143:MMLBHACB24IDJ"
    rota_reset="humility reset"
    ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
    ;;
  *)
    echo 'Where am I?'
    exit 1
    ;;
esac

ready_to_reset=false
failures_in_a_row=0

# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
#   humility reset failed: Probe could not be created
#
#  Caused by:
#      0: hidapi error: hid_error is not implemented yet
#  1: hidapi error: hid_error is not implemented yet
#
reset_grace=$MAX_GRACE_CYCLES

while true; do
  # Run our command
  if (${FAUX_MGS} state |& grep -q "RoT state: Ok" )
  then
    failures_in_a_row=0
    echo -n .
  else
    failures_in_a_row=$((failures_in_a_row+1))
    echo -n +$failures_in_a_row
  fi

  # increment our grace period counter
  reset_grace=$((reset_grace+1))

  # Are we done?
  if [[ failures_in_a_row -eq MAX_FAILS ]]; then
    echo "Test Complete: found faulty sprot after $(( $(date +%2) - START )) seconds."
    break
  fi

  # Should we reset the RoT to try again?
  choice=$RANDOM
  sleep=$((1/(choice % 100 + 1)))
  if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
    # reset our grace period counter
    reset_grace=0
    # Flip a coin to see if we should sleep before we try to reset or after
    # We right shift to create a boolean
    choice=$((choice >> 14))
    if [[ $choice -eq 0 ]]; then
      bash -c " sleep ${sleep} && ${ROT_RESET_COMMAND} &"
    else
      bash -c "( ${ROT_RESET_COMMAND} )&"
    fi
  else
    sleep 0.01
  fi
done

andrewjstone added a commit that referenced this issue Sep 8, 2023
The primary fix here is for the RoT to detect desynchronization between
the SP and RoT by seeing if `CSn` is still asserted *after* the RoT sees
the `ssd` bit set in the fifo status register and then to wait until
CSn is actually de-asserted before either replying to the SP or de-
asserting ROT_IRQ. Detection is performed by polling the `CHIP_SELECT`
gpio pin that corresponds to the `CSn` pin used by the SPI block on
the RoT.

The key problem being solved is that the `ssd` bit is saturating, and
so the RoT sprot server may look at this bit and think that the current
SPI transaction is done, but the bit is only set from a prior request.
The actual `CSn` de-assert arrives after the RoT reads and clears the
`ssd` bit, causing the bit to be set again. The next request from the
SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires
and the RoT goes into a tight loop, processes a fifo's worth of data
and again immediately sees the `ssd` bit set and ends it's response
early. In our traces we saw that this exposes itself by `ROT_IRQ` de-
asserting before `CSn` is actually de-asserted and sprot continues this
livelock indefinitely. If after every transaction, the RoT waits for the
`CSn` gpio line to actually become de-asserted before we consider the
SPItransaction complete, we know that we are operating on a request or
reply boundary, and thus the RoT and SP are resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the `sot` bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after `CSn` was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for `CSn` to actually be  deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing and
thinking more about this it seems semantically incorrect. We already
have `SprotProtocolError::FlowError` which indicates overruns. A missing
`sot` bit on the first fifo read of a request is actually just a flow
control error, except that instead of the missing bytes being in the
middle of the request, they are at the beginning. In the common case,
this should be detected via the `rxerror` bit, and we should return a
`FlowError`. If there is an actual desynchronization, we will detect
that after the request when we poll `CSn`. It is totally possible that
the RoT misses the first few bytes of an SP request but is not looking
at an `ssd` bit from a prior transaction. Informing the the `SP` that
this flow error is a very rare desynchronization event that is triggered
on sled restarts and bumping counters will lead to misleading debugging
paths IMO, and we should probably remove the code that uses the
`sot` bit before this is merged. You can see that I did not bump the
counter in this semantically different case even though we do return
`Desynchronized` to the SP in the main loop when `wait_for_request` hits
this case and returns `IoError::Desynchronized`.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think
it was really necessary to optimize for the `sys_irq_control` syscall
delay with regards to setting `ROT_IRQ` given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.
andrewjstone added a commit that referenced this issue Sep 8, 2023
The primary fix here is for the RoT to detect desynchronization between
the SP and RoT by seeing if `CSn` is still asserted *after* the RoT sees
the `ssd` bit set in the fifo status register and then to wait until
CSn is actually de-asserted before either replying to the SP or de-
asserting ROT_IRQ. Detection is performed by polling the `CHIP_SELECT`
gpio pin that corresponds to the `CSn` pin used by the SPI block on
the RoT.

The key problem being solved is that the `ssd` bit is saturating, and
so the RoT sprot server may look at this bit and think that the current
SPI transaction is done, but the bit is only set from a prior request.
The actual `CSn` de-assert arrives after the RoT reads and clears the
`ssd` bit, causing the bit to be set again. The next request from the
SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires
and the RoT goes into a tight loop, processes a fifo's worth of data
and again immediately sees the `ssd` bit set and ends it's response
early. In our traces we saw that this exposes itself by `ROT_IRQ` de-
asserting before `CSn` is actually de-asserted and sprot continues this
livelock indefinitely. If after every transaction, the RoT waits for the
`CSn` gpio line to actually become de-asserted before we consider the
SPItransaction complete, we know that we are operating on a request or
reply boundary, and thus the RoT and SP are resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the `sot` bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after `CSn` was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for `CSn` to actually be  deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing and
thinking more about this it seems semantically incorrect. We already
have `SprotProtocolError::FlowError` which indicates overruns. A missing
`sot` bit on the first fifo read of a request is actually just a flow
control error, except that instead of the missing bytes being in the
middle of the request, they are at the beginning. In the common case,
this should be detected via the `rxerror` bit, and we should return a
`FlowError`. If there is an actual desynchronization, we will detect
that after the request when we poll `CSn`. It is totally possible that
the RoT misses the first few bytes of an SP request but is not looking
at an `ssd` bit from a prior transaction. Informing the the `SP` that
this flow error is a very rare desynchronization event that is triggered
on sled restarts and bumping counters will lead to misleading debugging
paths IMO, and we should probably remove the code that uses the
`sot` bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think
it was really necessary to optimize for the `sys_irq_control` syscall
delay with regards to setting `ROT_IRQ` given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, `Desynchronization` errors are now recoverable.

It should also be noted how adding the new `Desyncrhonization` error
will affect upgrades. It is a new varient to the `SprotProtocolError`
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

 1. Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a `SprotProtocolError::Deserialization` error in
the case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use `sot` described above.

 2. Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.
andrewjstone added a commit that referenced this issue Sep 8, 2023
The primary fix here is for the RoT to detect desynchronization between
the SP and RoT by seeing if `CSn` is still asserted *after* the RoT sees
the `ssd` bit set in the fifo status register and then to wait until
CSn is actually de-asserted before either replying to the SP or de-
asserting ROT_IRQ. Detection is performed by polling the `CHIP_SELECT`
gpio pin that corresponds to the `CSn` pin used by the SPI block on
the RoT.

The key problem being solved is that the `ssd` bit is saturating, and
so the RoT sprot server may look at this bit and think that the current
SPI transaction is done, but the bit is only set from a prior request.
The actual `CSn` de-assert arrives after the RoT reads and clears the
`ssd` bit, causing the bit to be set again. The next request from the
SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires
and the RoT goes into a tight loop, processes a fifo's worth of data
and again immediately sees the `ssd` bit set and ends it's response
early. In our traces we saw that this exposes itself by `ROT_IRQ` de-
asserting before `CSn` is actually de-asserted and sprot continues this
livelock indefinitely. If after every transaction, the RoT waits for the
`CSn` gpio line to actually become de-asserted before we consider the
SPI transaction complete, we know that we are operating on a request or
reply boundary, and thus the RoT and SP are resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the `sot` bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after `CSn` was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for `CSn` to actually be  deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have `SprotProtocolError::FlowError` which indicates overruns.
A missing `sot` bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the `rxerror` bit, and we
should return a `FlowError`. If there is an actual desynchronization,
we will detect that after the request when we poll the `CHIP_SELECT`
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an `ssd` bit from a prior
transaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the `sot` bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think
it was really necessary to optimize for the `sys_irq_control` syscall
delay with regards to setting `ROT_IRQ` given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, `Desynchronization` errors are now recoverable.

It should also be noted how adding the new `Desyncrhonization` error
will affect upgrades. It is a new varient to the `SprotProtocolError`
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

 1. Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a `SprotProtocolError::Deserialization` error in
the case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use `sot` described above.

 2. Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.
@andrewjstone andrewjstone mentioned this issue Sep 8, 2023
andrewjstone added a commit that referenced this issue Sep 8, 2023
The primary fix here is for the RoT to detect desynchronization between
the SP and RoT and then wait until synchronization resumes before
proceeding. The RoT and SP are desynchronized if `CSn` is still asserted
*after* the RoT sees the `ssd` bit set in the fifo status register
during its tight loops while clocking in a request or clocking out a
reply to the SP. Synchronization is resumed by the RoT waiting until CSn
is actually de-asserted before either replying to the SP or de-asserting
ROT_IRQ. Detection of actual `CSn` state is performed by polling the
`CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI
block on the RoT that sets the `ssa` and `ssd` bits.

The key problem being solved is that the `ssd` bit is saturating, and so
the RoT sprot server may look at this bit and think that the current SPI
transaction is done, even though the bit was set during a prior request.
The actual `CSn` de-assert for the current request arrives after the
RoT reads and clears the `ssd` bit, causing the bit to be set again. The
next request from the SP then comes in, the `ssa` interrupt indicating
`CSn` asserted fires and the RoT goes into a tight loop, processes a
fifo's worth of data and again immediately sees the `ssd` bit set and
ends its response early. In our traces we saw that this exposes itself
by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and
sprot continues this livelock indefinitely. If after every transaction,
the RoT waits for the `CSn` gpio line to actually become de-asserted
before we consider the SPI transaction complete, we know that we are
operating on a request or reply boundary, and thus the RoT and SP are
resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the `sot` bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after `CSn` was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for `CSn` to actually be  deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have `SprotProtocolError::FlowError` which indicates overruns.
A missing `sot` bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the `rxerror` bit, and we
should return a `FlowError`. If there is an actual desynchronization,
we will detect that after the request when we poll the `CHIP_SELECT`
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an `ssd` bit from a prior
transaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the `sot` bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think
it was really necessary to optimize for the `sys_irq_control` syscall
delay with regards to setting `ROT_IRQ` given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, `Desynchronization` errors are now recoverable.

It should also be noted how adding the new `Desyncrhonization` error
will affect upgrades. It is a new varient to the `SprotProtocolError`
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

 1. Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a `SprotProtocolError::Deserialization` error in
the case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use `sot` described above.

 2. Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants