-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SP-RoT link ends up in a persistent bad state #1507
Comments
Possibly related, although a different error message: #1451 |
Yeah, thanks for linking that! I'm not sure if they're related; a timeout occurs if the RoT never asserts On the other hand, if we're blaming FIFOs for framing issues, we could hypothesize
(This is extremely hand-wavey / unjustified, e.g. I can't remember if the RoT replies with an error on a corrupt message, but just putting it out there) |
Re "This looks like a framing issue where the SP and RoT have gotten out of sync. ..." SPI messages are framed by the SPI bus CSn signal. |
Note: When SP does a flush (SPI CSn deassert, 10ms delay with no SPI clocks, CSn assert), the SP and RoT communications should be re-synchronized if they ever fell out of sync. ajs and I don't see anything wrong there yet. |
Is there any chance that the LPC55S69 parts are programmed to operate at 48MHz instead of the required 96MHz? |
I did some SysTick (1ms) resolution ringbuf logging that doesn't immediately shed any light.
|
I was unable to reproduce on madrid on tuesday, however today I reproduced a single time on my gimletlet and rot-carrier. Reproduction here means that we get an endless stream of Reproduced on the gimletlet/rot-carrier with Keeter's recommendation Spam Specifically, this command was run on my helios box:
Then on my mac: Here's the relevant RoT ringbuf:
What's odd is that consistently, It's like the RoT is stuck and not actually mutating its buffers. Here's the SP state for the last retry. The
The first 6 bytes indicate an error response for I hooked up the salae and I can see that the RoT is raising ROT_IRQ before it What could explain both these things: early ROT_IRQ raise and invalid read is We also see that the RoT is returning 0 bytes on reply and detecting an underrun. It's Going to add some more debug traces to the RoT side and try to reproduce again. Having a hard time reproducing again with my added ringbuf traces on the RoT. So We know from the RoT ringbuf that the RoT receives 16 bytes while We also know that the RoT when trying to reply with this error had not clocked We then notice in the Rot So maybe this is purely just a timing rhythm that kicks off when the RoT starts
This is plausible because we see an overrun in the RoT logs indicating a short It seems that this is all due to the retry timeout being 5ms and us hitting
Now, to help clarify, I'd like to get a reproduction with a salae trace. I'm |
Ok, reproduced again. Interestingly, a closer look shows that there are a few Here's the full gimletlet ringbuf
Here's the full RoT ringbuf:
|
I also noticed, scrolling back in the faux-mgs buffer that the first message after the timeouts from the RoT reset that starts this cycle is |
Looks like the RoT is coming out of reset and issuing it's first ROT_IRQ asserted at ~95.056 s on the Salae trace. At what looks like the same time, CSn is getting de-asserted. So it seems like ROT_IRQ is firing early here for some reason. Then it ends early, presumably because it sees CSn deasserted already when replying. ![]() ![]() On the next status request things look saner from the ROT_IRQ being asserted POV, but it still ends early. ![]() This pattern repeats indefinitely with ROT_IRQ de-asserting early. This is leading me strongly to believe that the SSD bit in the spi status register (CSn de-asserted) is being detected in the If I had to point to one suspicious bit in the code it would be the place where we probably do the mishandling of the SSD bit: hubris/drv/lpc55-sprot-server/src/main.rs Lines 294 to 300 in cd5f3c4
Then right after, we go ahead and drain our TX fifo, which probably triggers the zeroes in the CRC location to show up at the SP instead of the actual CRC: hubris/drv/lpc55-sprot-server/src/main.rs Lines 318 to 330 in cd5f3c4
|
Note that while my prior comment explains what happens during the reply, it still doesn't really explain why the RoT is seeing a bunk request. |
Hmm, actually it might. We also clear the receive fifo with the TODO: Figure out how to not misuse the SSD bit on the RoT! |
Awesome analysis! It sounds like this still matches with our understanding of how the SSA/SSD bits work? Those bits aren't used at all in NXP's reference driver so I have this nagging fear that they are undertested and wrongly documented. |
Thanks for taking a look @labbott! This matches my understanding of how the SSA/SSD bits work. Note that we do read them from different registers though. The SSA read is from the intstat register, which is destructive, while the SSD bit is from the status register which is non-destructive. This was a change I made while refactoring many months ago, but I don't think it's the cause of the issue or anything. AFAICT the bits act as we understand and as documented. I don't think we can necessarily blame NXP for anything this time 😆 |
The above is presumably because as the RoT comes out of reset, SSA and SSD are already set so it reads what's in its fifo and goes to reply. Then the request from the SP actually ends. |
Recommendation from our discussion: on start, lpc55 sprot_serverr asserts ROT_IRQ until it has finished initializing FIFOs and SSA/SSD state. This does not fully cover the case where RoT has been reset and the SP steps on FIFO initialization, but it does cover some of that window and makes it much easier to find these failures on a logic analyzer. |
Recommendation from Matt: after hot path receive, the RoT checks for CSn asserted. If asserted, we have a synchronization error. RoT waits for CSn deasserted, then cleans up, and queues a SYNC_ERROR message, then asserts ROT_IRQ. |
I wrote a script that can sometimes reproduce. I'll likely continue tweaking: #!/usr/bin/env bash
MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}
HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=100}
ready_to_reset=false
failures_in_a_row=0
# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
# humility reset failed: Probe could not be created
#
# Caused by:
# 0: hidapi error: hid_error is not implemented yet
# 1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES
while true; do
# Run our command
output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
echo "$output" | grep -q "RoT state: Ok"
if [[ $? -eq 0 ]]; then
failures_in_a_row=0
echo "."
else
failures_in_a_row=$((failures_in_a_row+1))
echo "+"
fi
# increment our grace period counter
reset_grace=$((reset_grace+1))
# Are we done?
if [[ failures_in_a_row -eq MAX_FAILS ]]; then
echo "Test Complete: found faulty sprot"
break
fi
# Should we reset the RoT to try again?
choice=$RANDOM
sleep=$((1/(choice % 100 + 1)))
if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
# reset our grace period counter
reset_grace=0
# Flip a coin to see if we should sleep before we try to reset or after
# We right shift to create a boolean
let "choice >>= 14"
if [[ $choice -eq 0 ]]; then
HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
else
HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
fi
else
sleep 0.01
fi
done |
Further validation from a local reproduction that the
And here's a screenshot of the early ROT_IRQ de-assertion due to desync from the salae trace for this run. This happens indefinitely. |
Adapted Andrew's script for my environment:
|
The primary fix here is for the RoT to detect desynchronization between the SP and RoT by seeing if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register and then to wait until CSn is actually de-asserted before either replying to the SP or de- asserting ROT_IRQ. Detection is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, but the bit is only set from a prior request. The actual `CSn` de-assert arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends it's response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPItransaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll `CSn`. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the the `SP` that this flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. You can see that I did not bump the counter in this semantically different case even though we do return `Desynchronized` to the SP in the main loop when `wait_for_request` hits this case and returns `IoError::Desynchronized`. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate.
The primary fix here is for the RoT to detect desynchronization between the SP and RoT by seeing if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register and then to wait until CSn is actually de-asserted before either replying to the SP or de- asserting ROT_IRQ. Detection is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, but the bit is only set from a prior request. The actual `CSn` de-assert arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends it's response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPItransaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll `CSn`. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the the `SP` that this flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate. On the SP side, `Desynchronization` errors are now recoverable. It should also be noted how adding the new `Desyncrhonization` error will affect upgrades. It is a new varient to the `SprotProtocolError` enum and therfore code that is unaware of this variant will not be able to deserialize it. Specifically: 1. Because the RoT is upgraded before the SP in mupdate this means that the SP code will see a `SprotProtocolError::Deserialization` error in the case of desynchronization. This is already a recoverable error and the behavior of the SP sprot server should be the same, except that if retries are exceeded for some reason, the wrong error could be plumbed up to the control-plane-agent and MGS. This is exceedingly unlikely for this specific error, except for perhaps in the flow control case where we use `sot` described above. 2. Until MGS is updated, if the new error variant gets plumbed upwards it will be seen as an incompatible protocol error. This is not really a big deal in this case, as we are still mupdating and this is the only related error that can occur this way until the system is further upgraded.
The primary fix here is for the RoT to detect desynchronization between the SP and RoT by seeing if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register and then to wait until CSn is actually de-asserted before either replying to the SP or de- asserting ROT_IRQ. Detection is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, but the bit is only set from a prior request. The actual `CSn` de-assert arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends it's response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPI transaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll the `CHIP_SELECT` gpio pin. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the SP that this common flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate. On the SP side, `Desynchronization` errors are now recoverable. It should also be noted how adding the new `Desyncrhonization` error will affect upgrades. It is a new varient to the `SprotProtocolError` enum and therfore code that is unaware of this variant will not be able to deserialize it. Specifically: 1. Because the RoT is upgraded before the SP in mupdate this means that the SP code will see a `SprotProtocolError::Deserialization` error in the case of desynchronization. This is already a recoverable error and the behavior of the SP sprot server should be the same, except that if retries are exceeded for some reason, the wrong error could be plumbed up to the control-plane-agent and MGS. This is exceedingly unlikely for this specific error, except for perhaps in the flow control case where we use `sot` described above. 2. Until MGS is updated, if the new error variant gets plumbed upwards it will be seen as an incompatible protocol error. This is not really a big deal in this case, as we are still mupdating and this is the only related error that can occur this way until the system is further upgraded.
The primary fix here is for the RoT to detect desynchronization between the SP and RoT and then wait until synchronization resumes before proceeding. The RoT and SP are desynchronized if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register during its tight loops while clocking in a request or clocking out a reply to the SP. Synchronization is resumed by the RoT waiting until CSn is actually de-asserted before either replying to the SP or de-asserting ROT_IRQ. Detection of actual `CSn` state is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT that sets the `ssa` and `ssd` bits. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, even though the bit was set during a prior request. The actual `CSn` de-assert for the current request arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends its response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPI transaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll the `CHIP_SELECT` gpio pin. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the SP that this common flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate. On the SP side, `Desynchronization` errors are now recoverable. It should also be noted how adding the new `Desyncrhonization` error will affect upgrades. It is a new varient to the `SprotProtocolError` enum and therfore code that is unaware of this variant will not be able to deserialize it. Specifically: 1. Because the RoT is upgraded before the SP in mupdate this means that the SP code will see a `SprotProtocolError::Deserialization` error in the case of desynchronization. This is already a recoverable error and the behavior of the SP sprot server should be the same, except that if retries are exceeded for some reason, the wrong error could be plumbed up to the control-plane-agent and MGS. This is exceedingly unlikely for this specific error, except for perhaps in the flow control case where we use `sot` described above. 2. Until MGS is updated, if the new error variant gets plumbed upwards it will be seen as an incompatible protocol error. This is not really a big deal in this case, as we are still mupdating and this is the only related error that can occur this way until the system is further upgraded.
While updating the colo rack with the dogfood image (https://github.com/oxidecomputer/omicron/runs/15999510273), we updated both the SP and RoT images using mupdate / wicket.
(Normally, we skip updating the RoTs during mupdate, because they haven't changed in a while)
Augustus noticed one RoT showing a failure to communicate after the initial update; a second update brought this number to three RoTs.
Here's an example of an unhappy RoT:

For contrast, a happy RoT:

The ringbuf shows invalid data received by the SP when we try to request the RoT status:
We would expect the
RxBuf
messages to begin with[3, 0, 0, 0]
, because that's the current protocol version serialized withu32::to_le_bytes()
.Looking at the
spi_server_core
ringbuf, we do see that pattern:However, it occurs before the
Start
log in the ringbuf! That's weird and surprising. Because we're retrying three times, it's likely from a previous message; unfortunately, the ringbuf isn't long enough to record all three attempts.This looks like a framing issue where the SP and RoT have gotten out of sync. There are FIFO buffers on each side's SPI peripheral. Resetting the SP did not fix the issue; power-cycling the whole sled (via Ignition) does fix the issue. This leads us to suspect the RoT's Tx FIFO.
(In particular, I'm wondering what happens if the RoT begins listening to SPI midway through an attempted transaction from the SP, since the SP will try talking to it unconditionally)
The text was updated successfully, but these errors were encountered: