Skip to content

Commit

Permalink
Fix #1507
Browse files Browse the repository at this point in the history
The primary fix here is for the RoT to detect desynchronization between
the SP and RoT and then wait until synchronization resumes before
proceeding. The RoT and SP are desynchronized if `CSn` is still asserted
*after* the RoT sees the `ssd` bit set in the fifo status register
during its tight loops while clocking in a request or clocking out a
reply to the SP. Synchronization is resumed by the RoT waiting until CSn
is actually de-asserted before either replying to the SP or de-asserting
ROT_IRQ. Detection of actual `CSn` state is performed by polling the
`CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI
block on the RoT that sets the `ssa` and `ssd` bits.

The key problem being solved is that the `ssd` bit is saturating, and so
the RoT sprot server may look at this bit and think that the current SPI
transaction is done, even though the bit was set during a prior request.
The actual `CSn` de-assert for the current request arrives after the
RoT reads and clears the `ssd` bit, causing the bit to be set again. The
next request from the SP then comes in, the `ssa` interrupt indicating
`CSn` asserted fires and the RoT goes into a tight loop, processes a
fifo's worth of data and again immediately sees the `ssd` bit set and
ends its response early. In our traces we saw that this exposes itself
by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and
sprot continues this livelock indefinitely. If after every transaction,
the RoT waits for the `CSn` gpio line to actually become de-asserted
before we consider the SPI transaction complete, we know that we are
operating on a request or reply boundary, and thus the RoT and SP are
resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the `sot` bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after `CSn` was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for `CSn` to actually be  deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have `SprotProtocolError::FlowError` which indicates overruns.
A missing `sot` bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the `rxerror` bit, and we
should return a `FlowError`. If there is an actual desynchronization,
we will detect that after the request when we poll the `CHIP_SELECT`
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an `ssd` bit from a prior
transaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the `sot` bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think
it was really necessary to optimize for the `sys_irq_control` syscall
delay with regards to setting `ROT_IRQ` given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, `Desynchronization` errors are now recoverable.

It should also be noted how adding the new `Desyncrhonization` error
will affect upgrades. It is a new varient to the `SprotProtocolError`
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

 1. Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a `SprotProtocolError::Deserialization` error in
the case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use `sot` described above.

 2. Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.
  • Loading branch information
andrewjstone committed Sep 8, 2023
1 parent 09c9708 commit c4bfc5f
Show file tree
Hide file tree
Showing 8 changed files with 217 additions and 125 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,8 @@ zip = { version = "0.6", default-features = false, features = ["bzip2"] }

# Oxide forks and repos
dice-mfg-msgs = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.2.1" }
gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false, features = ["smoltcp"] }
#gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false, features = ["smoltcp"] }
gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false, features = ["smoltcp"], branch = "sprot-desync-error" }
hif = { git = "https://github.com/oxidecomputer/hif", default-features = false }
humpty = { git = "https://github.com/oxidecomputer/humpty", default-features = false, version = "0.1.3" }
hubtools = { git = "https://github.com/oxidecomputer/hubtools", default-features = false, version = "0.4.1" }
Expand Down
4 changes: 2 additions & 2 deletions app/rot-carrier/app.toml
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ task-slots = ["syscon_driver"]
[tasks.sprot]
name = "drv-lpc55-sprot-server"
priority = 6
max-sizes = {flash = 44384, ram = 32768}
max-sizes = {flash = 45384, ram = 32768}
uses = ["flexcomm8", "bootrom"]
features = ["spi0"]
start = true
Expand All @@ -125,7 +125,7 @@ pins = [
# HS_SPI_SCK = P1_2 = FUN6
{ pin = { port = 1, pin = 2}, alt = 6},
# HS_SSEL1 = P1_1 = FUN5
{ pin = { port = 1, pin = 1}, alt = 5},
{ name = "CHIP_SELECT", pin = { port = 1, pin = 1}, alt = 5},
# ROT_IRQ = P0_18 = FUN0
{ name = "ROT_IRQ", pin = { port = 0, pin = 18}, alt = 0, direction = "output"},
# SP_RESET = P0_9 = FUN0
Expand Down
9 changes: 7 additions & 2 deletions drv/lpc55-spi/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ impl Spi {
}

/// Clear Slave Select Asserted interrupt
pub fn ssa_clear(&mut self) {
pub fn ssa_clear(&self) {
self.reg.stat.write(|w| w.ssa().set_bit());
}

Expand Down Expand Up @@ -248,7 +248,7 @@ impl Spi {
/// but that would leave us in a weird place where we wouldn't know what state
/// the fifo was in. Since FIFOWR is write-only, we just set the frame size
/// on each wite and ensure we pair writes and reads of the same width.
pub fn send_u16(&mut self, entry: u16) {
pub fn send_u16(&self, entry: u16) {
self.reg.fifowr.write(|w| unsafe {
w.len()
// 0xF = Data transfer is 16 bits in length.
Expand Down Expand Up @@ -340,4 +340,9 @@ impl Spi {
pub fn read_u16(&mut self) -> u16 {
self.reg.fiford.read().rxdata().bits() as u16
}

pub fn read_u16_with_sot(&self) -> (u16, bool) {
let reader = self.reg.fiford.read();
(reader.rxdata().bits() as u16, reader.sot().bit_is_set())
}
}
9 changes: 8 additions & 1 deletion drv/lpc55-sprot-server/src/handler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,19 @@ impl Handler {
}
}

/// Serialize and return a `SprotError::FlowError`
pub fn flow_error(&self, tx_buf: &mut [u8; RESPONSE_BUF_SIZE]) -> usize {
let body = Err(SprotProtocolError::FlowError.into());
Response::pack(&body, tx_buf)
}

pub fn desynchronized_error(
&self,
tx_buf: &mut [u8; RESPONSE_BUF_SIZE],
) -> usize {
let body = Err(SprotProtocolError::Desynchronized.into());
Response::pack(&body, tx_buf)
}

pub fn handle(
&mut self,
rx_buf: &[u8],
Expand Down
Loading

0 comments on commit c4bfc5f

Please sign in to comment.