-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1507 #1523
Fix #1507 #1523
Conversation
Before merging, we'll need to merge in oxidecomputer/management-gateway-service#121. However, before we do that, we'll need to deal with oxidecomputer/management-gateway-service#123. |
pub timeouts: u32, | ||
|
||
/// Number of times the RoT has reported that it was desynchronized | ||
pub desynchronized: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably increment this somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will require more refactors than I'm willing to make right now on this PR.
I should also note that this code needs testing on real hardware and not the gimletlet/rot-carrier. I'll need to update |
The primary fix here is for the RoT to detect desynchronization between the SP and RoT and then wait until synchronization resumes before proceeding. The RoT and SP are desynchronized if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register during its tight loops while clocking in a request or clocking out a reply to the SP. Synchronization is resumed by the RoT waiting until CSn is actually de-asserted before either replying to the SP or de-asserting ROT_IRQ. Detection of actual `CSn` state is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT that sets the `ssa` and `ssd` bits. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, even though the bit was set during a prior request. The actual `CSn` de-assert for the current request arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends its response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPI transaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll the `CHIP_SELECT` gpio pin. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the SP that this common flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate. On the SP side, `Desynchronization` errors are now recoverable. It should also be noted how adding the new `Desyncrhonization` error will affect upgrades. It is a new varient to the `SprotProtocolError` enum and therfore code that is unaware of this variant will not be able to deserialize it. Specifically: 1. Because the RoT is upgraded before the SP in mupdate this means that the SP code will see a `SprotProtocolError::Deserialization` error in the case of desynchronization. This is already a recoverable error and the behavior of the SP sprot server should be the same, except that if retries are exceeded for some reason, the wrong error could be plumbed up to the control-plane-agent and MGS. This is exceedingly unlikely for this specific error, except for perhaps in the flow control case where we use `sot` described above. 2. Until MGS is updated, if the new error variant gets plumbed upwards it will be seen as an incompatible protocol error. This is not really a big deal in this case, as we are still mupdating and this is the only related error that can occur this way until the system is further upgraded.
b06d9bd
to
c4bfc5f
Compare
I'm not sure about this. Are we sure this is going to be detected via |
let (read, sot) = self.spi.read_u16_with_sot(); | ||
if first_read { | ||
first_read = false; | ||
if !sot { | ||
self.stats.desynchronized = | ||
self.stats.desynchronized.wrapping_add(1); | ||
return Err(IoError::Desynchronized); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we removed this code wouldn't we still run into the issue where SP has started a transmission SSa gets asserted but we don't notice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it would be detected when trying to reply because CSn would still be asserted while SSD is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, after thinking more about this, I withdraw most of my concern. The code works as is, and I haven't actually seen this hit during upgrades even though I've seen a handful of rxerrors(which is on par with what our current error rates are).
Interesting. I thought all dropped bytes were detected by rxerror, except perhaps on startup before we initialize spi or because we mistakenly clear the error. I didn't really think about the case that the SOT bit would still be set even if early bytes were dropped. That seems quite problematic for its usage here. The problem we are solving for is missing initial bytes. |
The repro test script has been running on my gimletlet/rot-carrier for over 3 hours now with no reproductions. I also just installed a this image on a sled on madrid via faux-mgs and it installed fine and state was accessible. |
The script I am running to reproduce the issue is: #!/usr/bin/env bash
MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}
HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}
# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=110}
ready_to_reset=false
failures_in_a_row=0
# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
# humility reset failed: Probe could not be created
#
# Caused by:
# 0: hidapi error: hid_error is not implemented yet
# 1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES
while true; do
# Run our command
output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
echo "$output" | grep -q "RoT state: Ok"
if [[ $? -eq 0 ]]; then
failures_in_a_row=0
echo "."
else
failures_in_a_row=$((failures_in_a_row+1))
echo "+"
fi
# increment our grace period counter
reset_grace=$((reset_grace+1))
# Are we done?
if [[ failures_in_a_row -eq MAX_FAILS ]]; then
echo "Test Complete: found faulty sprot"
break
fi
# Should we reset the RoT to try again?
choice=$RANDOM
sleep=$((1/(choice % 100 + 1)))
if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
# reset our grace period counter
reset_grace=0
# Flip a coin to see if we should sleep before we try to reset or after
# We right shift to create a boolean
let "choice >>= 14"
if [[ $choice -eq 0 ]]; then
HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
else
HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
fi
else
sleep 0.01
fi
done |
first_read = false; | ||
if !sot { | ||
self.stats.desynchronized = | ||
self.stats.desynchronized.wrapping_add(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe saturating_add
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this match all the other counters. I specifically did that because it seems more useful to know if a counter is moving than if it saturated at one point in time. Although, in either case we shouldn't really come close to hitting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I don't feel strongly about this!
drv/lpc55-sprot-server/src/main.rs
Outdated
self.stats.desynchronized = | ||
self.stats.desynchronized.wrapping_add(1); | ||
} | ||
result = Err(DesynchronizedError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe move this into the conditional, to make it more obvious that the conditional is only executed on the first transition from Ok
-> Err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good with the few changes suggested. Given how there's almost no debugging options if this link is down are there any other options for reporting an error or making it possible to get state out? We can file as a tracking item if necessary.
drv/lpc55-sprot-server/src/main.rs
Outdated
ringbuf_entry!(Trace::Stats(io.stats)); | ||
io.reply(&tx_buf[..rsp_len]); | ||
} | ||
} | ||
|
||
struct DesynchronizedError; | ||
impl Io { | ||
// Wait for chip select to be asserted | ||
// Assert ROT_IRQ if this is a reply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment if we're not doing this part anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Begone!
drv/lpc55-sprot-server/src/main.rs
Outdated
} | ||
|
||
// Reply to the SP | ||
// Our fifos are guaranteed to be empty at this point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we could add an assert/error for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. There's also always an inherent TOCTUO here. We could check the fifos, but they may not be empty right after the check.
If they aren't empty we'll end up with some sort of error here or at the SP. I'll be more detailed in the comment.
I'm not sure how we would do that. I think we would have to raise the fault pin and then reset the RoT. |
The primary fix here is for the RoT to detect desynchronization between
the SP and RoT and then wait until synchronization resumes before
proceeding. The RoT and SP are desynchronized if
CSn
is still assertedafter the RoT sees the
ssd
bit set in the fifo status registerduring its tight loops while clocking in a request or clocking out a
reply to the SP. Synchronization is resumed by the RoT waiting until CSn
is actually de-asserted before either replying to the SP or de-asserting
ROT_IRQ. Detection of actual
CSn
state is performed by polling theCHIP_SELECT
gpio pin that corresponds to theCSn
pin used by the SPIblock on the RoT that sets the
ssa
andssd
bits.The key problem being solved is that the
ssd
bit is saturating, and sothe RoT sprot server may look at this bit and think that the current SPI
transaction is done, even though the bit was set during a prior request.
The actual
CSn
de-assert for the current request arrives after theRoT reads and clears the
ssd
bit, causing the bit to be set again. Thenext request from the SP then comes in, the
ssa
interrupt indicatingCSn
asserted fires and the RoT goes into a tight loop, processes afifo's worth of data and again immediately sees the
ssd
bit set andends its response early. In our traces we saw that this exposes itself
by
ROT_IRQ
de- asserting beforeCSn
is actually de-asserted andsprot continues this livelock indefinitely. If after every transaction,
the RoT waits for the
CSn
gpio line to actually become de-assertedbefore we consider the SPI transaction complete, we know that we are
operating on a request or reply boundary, and thus the RoT and SP are
resynchronized.
Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the
sot
bit on the first read from the fifo to see if this is indeedthe first fifo entry being read after
CSn
was asserted. If this isnot the case the RoT immediately declares a desynchronization, waits
for
CSn
to actually be deasserted via gpio read, replies to theSP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have
SprotProtocolError::FlowError
which indicates overruns.A missing
sot
bit on the first fifo read of a request is actuallyjust a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the
rxerror
bit, and weshould return a
FlowError
. If there is an actual desynchronization,we will detect that after the request when we poll the
CHIP_SELECT
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an
ssd
bit from a priortransaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the
sot
bit before this is merged.There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside
reply
. I didn't thinkit was really necessary to optimize for the
sys_irq_control
syscalldelay with regards to setting
ROT_IRQ
given that we have a 16 bytefifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.
On the SP side,
Desynchronization
errors are now recoverable.It should also be noted how adding the new
Desyncrhonization
errorwill affect upgrades. It is a new varient to the
SprotProtocolError
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:
Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a
SprotProtocolError::Deserialization
error inthe case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use
sot
described above.Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.