Fix #1507 #1523

andrewjstone · 2023-09-08T08:05:35Z

The primary fix here is for the RoT to detect desynchronization between
the SP and RoT and then wait until synchronization resumes before
proceeding. The RoT and SP are desynchronized if CSn is still asserted
after the RoT sees the ssd bit set in the fifo status register
during its tight loops while clocking in a request or clocking out a
reply to the SP. Synchronization is resumed by the RoT waiting until CSn
is actually de-asserted before either replying to the SP or de-asserting
ROT_IRQ. Detection of actual CSn state is performed by polling the
CHIP_SELECT gpio pin that corresponds to the CSn pin used by the SPI
block on the RoT that sets the ssa and ssd bits.

The key problem being solved is that the ssd bit is saturating, and so
the RoT sprot server may look at this bit and think that the current SPI
transaction is done, even though the bit was set during a prior request.
The actual CSn de-assert for the current request arrives after the
RoT reads and clears the ssd bit, causing the bit to be set again. The
next request from the SP then comes in, the ssa interrupt indicating
CSn asserted fires and the RoT goes into a tight loop, processes a
fifo's worth of data and again immediately sees the ssd bit set and
ends its response early. In our traces we saw that this exposes itself
by ROT_IRQ de- asserting before CSn is actually de-asserted and
sprot continues this livelock indefinitely. If after every transaction,
the RoT waits for the CSn gpio line to actually become de-asserted
before we consider the SPI transaction complete, we know that we are
operating on a request or reply boundary, and thus the RoT and SP are
resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the sot bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after CSn was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for CSn to actually be deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have SprotProtocolError::FlowError which indicates overruns.
A missing sot bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the rxerror bit, and we
should return a FlowError. If there is an actual desynchronization,
we will detect that after the request when we poll the CHIP_SELECT
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an ssd bit from a prior
transaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the sot bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside reply. I didn't think
it was really necessary to optimize for the sys_irq_control syscall
delay with regards to setting ROT_IRQ given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, Desynchronization errors are now recoverable.

It should also be noted how adding the new Desyncrhonization error
will affect upgrades. It is a new varient to the SprotProtocolError
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a SprotProtocolError::Deserialization error in
the case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use sot described above.
Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.

andrewjstone · 2023-09-08T08:07:05Z

Before merging, we'll need to merge in oxidecomputer/management-gateway-service#121. However, before we do that, we'll need to deal with oxidecomputer/management-gateway-service#123.

andrewjstone · 2023-09-08T08:13:55Z

drv/sprot-api/src/lib.rs

    pub timeouts: u32,
+
+    /// Number of times the RoT has reported that it was desynchronized
+    pub desynchronized: u32,


We should probably increment this somewhere.

I think this will require more refactors than I'm willing to make right now on this PR.

andrewjstone · 2023-09-08T08:18:42Z

I should also note that this code needs testing on real hardware and not the gimletlet/rot-carrier. I'll need to update app/oxide-rot-1/app.toml to do this. The image will also have to be signed by non-bartholemew keys I believe.

The primary fix here is for the RoT to detect desynchronization between the SP and RoT and then wait until synchronization resumes before proceeding. The RoT and SP are desynchronized if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register during its tight loops while clocking in a request or clocking out a reply to the SP. Synchronization is resumed by the RoT waiting until CSn is actually de-asserted before either replying to the SP or de-asserting ROT_IRQ. Detection of actual `CSn` state is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT that sets the `ssa` and `ssd` bits. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, even though the bit was set during a prior request. The actual `CSn` de-assert for the current request arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends its response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPI transaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll the `CHIP_SELECT` gpio pin. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the SP that this common flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate. On the SP side, `Desynchronization` errors are now recoverable. It should also be noted how adding the new `Desyncrhonization` error will affect upgrades. It is a new varient to the `SprotProtocolError` enum and therfore code that is unaware of this variant will not be able to deserialize it. Specifically: 1. Because the RoT is upgraded before the SP in mupdate this means that the SP code will see a `SprotProtocolError::Deserialization` error in the case of desynchronization. This is already a recoverable error and the behavior of the SP sprot server should be the same, except that if retries are exceeded for some reason, the wrong error could be plumbed up to the control-plane-agent and MGS. This is exceedingly unlikely for this specific error, except for perhaps in the flow control case where we use `sot` described above. 2. Until MGS is updated, if the new error variant gets plumbed upwards it will be seen as an incompatible protocol error. This is not really a big deal in this case, as we are still mupdating and this is the only related error that can occur this way until the system is further upgraded.

labbott · 2023-09-08T13:40:11Z

However, after implementing
and thinking more about this it seems semantically incorrect. We
already have SprotProtocolError::FlowError which indicates overruns.
A missing sot bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the rxerror bit, and we
should return a FlowError

I'm not sure about this. Are we sure this is going to be detected via rxerror? If bytes are being dropped I would think that the hardware would still see it as the start of transfer and still set the bit.

labbott · 2023-09-08T13:51:28Z

drv/lpc55-sprot-server/src/main.rs

+                let (read, sot) = self.spi.read_u16_with_sot();
+                if first_read {
+                    first_read = false;
+                    if !sot {
+                        self.stats.desynchronized =
+                            self.stats.desynchronized.wrapping_add(1);
+                        return Err(IoError::Desynchronized);
+                    }
+                }


If we removed this code wouldn't we still run into the issue where SP has started a transmission SSa gets asserted but we don't notice?

Yes, but it would be detected when trying to reply because CSn would still be asserted while SSD is set.

So, after thinking more about this, I withdraw most of my concern. The code works as is, and I haven't actually seen this hit during upgrades even though I've seen a handful of rxerrors(which is on par with what our current error rates are).

andrewjstone · 2023-09-11T16:25:09Z

However, after implementing
and thinking more about this it seems semantically incorrect. We
already have SprotProtocolError::FlowError which indicates overruns.
A missing sot bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the rxerror bit, and we
should return a FlowError

I'm not sure about this. Are we sure this is going to be detected via rxerror? If bytes are being dropped I would think that the hardware would still see it as the start of transfer and still set the bit.

Interesting. I thought all dropped bytes were detected by rxerror, except perhaps on startup before we initialize spi or because we mistakenly clear the error. I didn't really think about the case that the SOT bit would still be set even if early bytes were dropped. That seems quite problematic for its usage here. The problem we are solving for is missing initial bytes.

andrewjstone · 2023-09-12T00:09:03Z

The repro test script has been running on my gimletlet/rot-carrier for over 3 hours now with no reproductions.

I also just installed a this image on a sled on madrid via faux-mgs and it installed fine and state was accessible.

andrewjstone · 2023-09-12T00:10:12Z

The script I am running to reproduce the issue is:

#!/usr/bin/env bash


MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}

HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}

# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=110}


ready_to_reset=false
failures_in_a_row=0

# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
#   humility reset failed: Probe could not be created
#
#  Caused by:
#      0: hidapi error: hid_error is not implemented yet
#  	    1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES

while true; do
  # Run our command
  output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
  echo "$output" | grep -q "RoT state: Ok"
  if [[ $? -eq 0 ]]; then
    failures_in_a_row=0
    echo "."
  else
    failures_in_a_row=$((failures_in_a_row+1))
    echo "+"
  fi

  # increment our grace period counter
  reset_grace=$((reset_grace+1))

  # Are we done?
  if [[ failures_in_a_row -eq MAX_FAILS ]]; then
    echo "Test Complete: found faulty sprot"
    break
  fi

  # Should we reset the RoT to try again?
  choice=$RANDOM
  sleep=$((1/(choice % 100 + 1)))
  if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
    # reset our grace period counter
    reset_grace=0
    # Flip a coin to see if we should sleep before we try to reset or after
    # We right shift to create a boolean
    let "choice >>= 14"
    if [[ $choice -eq 0 ]]; then
      HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
    else 
      HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
    fi 
  else
    sleep 0.01
  fi
done

mkeeter · 2023-09-12T20:22:37Z

drv/lpc55-sprot-server/src/main.rs

+                    first_read = false;
+                    if !sot {
+                        self.stats.desynchronized =
+                            self.stats.desynchronized.wrapping_add(1);


Maybe saturating_add instead?

I made this match all the other counters. I specifically did that because it seems more useful to know if a counter is moving than if it saturated at one point in time. Although, in either case we shouldn't really come close to hitting this.

Okay, I don't feel strongly about this!

mkeeter · 2023-09-12T20:24:33Z

drv/lpc55-sprot-server/src/main.rs

+                self.stats.desynchronized =
+                    self.stats.desynchronized.wrapping_add(1);
+            }
+            result = Err(DesynchronizedError);


Nit: maybe move this into the conditional, to make it more obvious that the conditional is only executed on the first transition from Ok -> Err?

labbott

I think this looks good with the few changes suggested. Given how there's almost no debugging options if this link is down are there any other options for reporting an error or making it possible to get state out? We can file as a tracking item if necessary.

labbott · 2023-09-13T18:47:12Z

drv/lpc55-sprot-server/src/main.rs

        ringbuf_entry!(Trace::Stats(io.stats));
        io.reply(&tx_buf[..rsp_len]);
    }
 }

+struct DesynchronizedError;
 impl Io {
    // Wait for chip select to be asserted
    // Assert ROT_IRQ if this is a reply


Remove this comment if we're not doing this part anymore

Good catch. Begone!

labbott · 2023-09-13T18:52:25Z

drv/lpc55-sprot-server/src/main.rs

+    }
+
+    // Reply to the SP
+    // Our fifos are guaranteed to be empty at this point


Is there any way we could add an assert/error for this?

Not really. There's also always an inherent TOCTUO here. We could check the fifos, but they may not be empty right after the check.

If they aren't empty we'll end up with some sort of error here or at the SP. I'll be more detailed in the comment.

andrewjstone · 2023-09-14T19:50:49Z

I think this looks good with the few changes suggested. Given how there's almost no debugging options if this link is down are there any other options for reporting an error or making it possible to get state out? We can file as a tracking item if necessary.

I'm not sure how we would do that. I think we would have to raise the fault pin and then reset the RoT.

andrewjstone requested review from mkeeter, lzrd and labbott September 8, 2023 08:05

andrewjstone mentioned this pull request Sep 8, 2023

Fix #1507 #1519

Closed

andrewjstone commented Sep 8, 2023

View reviewed changes

andrewjstone force-pushed the fix-1507 branch from b06d9bd to c4bfc5f Compare September 8, 2023 09:08

labbott reviewed Sep 8, 2023

View reviewed changes

andrewjstone added 4 commits September 11, 2023 14:13

Move back onto MGS main branch

385ecfa

Merge branch 'master' into fix-1507

2b1a041

Bump MGS

9143b69

Add a pin name and some flash for oxide-rot-1

88c50fc

andrewjstone requested review from cbiffle and flihp as code owners September 11, 2023 18:54

Add a pin name and some flash for oxide-rot-1/app-dev.toml

9c5ee5d

mkeeter reviewed Sep 12, 2023

View reviewed changes

labbott reviewed Sep 13, 2023

View reviewed changes

mkeeter approved these changes Sep 14, 2023

View reviewed changes

andrewjstone added 3 commits September 14, 2023 15:52

Fixes from review

35c0357

Merge branch 'master' into fix-1507

5581bd9

Bump flash sizes for sprot

c126696

labbott approved these changes Sep 14, 2023

View reviewed changes

andrewjstone merged commit de395a8 into master Sep 14, 2023
71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1507 #1523

Fix #1507 #1523

andrewjstone commented Sep 8, 2023 •

edited

Loading

andrewjstone commented Sep 8, 2023

andrewjstone Sep 8, 2023

andrewjstone Sep 14, 2023

andrewjstone commented Sep 8, 2023

labbott commented Sep 8, 2023

labbott Sep 8, 2023

andrewjstone Sep 11, 2023

andrewjstone Sep 14, 2023

andrewjstone commented Sep 11, 2023

andrewjstone commented Sep 12, 2023

andrewjstone commented Sep 12, 2023

mkeeter Sep 12, 2023

andrewjstone Sep 12, 2023

mkeeter Sep 12, 2023

mkeeter Sep 12, 2023

andrewjstone Sep 12, 2023

labbott left a comment

labbott Sep 13, 2023

andrewjstone Sep 14, 2023

labbott Sep 13, 2023

andrewjstone Sep 14, 2023

andrewjstone commented Sep 14, 2023

Fix #1507 #1523

Fix #1507 #1523

Conversation

andrewjstone commented Sep 8, 2023 • edited Loading

andrewjstone commented Sep 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone commented Sep 8, 2023

labbott commented Sep 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone commented Sep 11, 2023

andrewjstone commented Sep 12, 2023

andrewjstone commented Sep 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

labbott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone commented Sep 14, 2023

andrewjstone commented Sep 8, 2023 •

edited

Loading