Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1507 #1523

Merged
merged 9 commits into from
Sep 14, 2023
Merged

Fix #1507 #1523

merged 9 commits into from
Sep 14, 2023

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Sep 8, 2023

The primary fix here is for the RoT to detect desynchronization between
the SP and RoT and then wait until synchronization resumes before
proceeding. The RoT and SP are desynchronized if CSn is still asserted
after the RoT sees the ssd bit set in the fifo status register
during its tight loops while clocking in a request or clocking out a
reply to the SP. Synchronization is resumed by the RoT waiting until CSn
is actually de-asserted before either replying to the SP or de-asserting
ROT_IRQ. Detection of actual CSn state is performed by polling the
CHIP_SELECT gpio pin that corresponds to the CSn pin used by the SPI
block on the RoT that sets the ssa and ssd bits.

The key problem being solved is that the ssd bit is saturating, and so
the RoT sprot server may look at this bit and think that the current SPI
transaction is done, even though the bit was set during a prior request.
The actual CSn de-assert for the current request arrives after the
RoT reads and clears the ssd bit, causing the bit to be set again. The
next request from the SP then comes in, the ssa interrupt indicating
CSn asserted fires and the RoT goes into a tight loop, processes a
fifo's worth of data and again immediately sees the ssd bit set and
ends its response early. In our traces we saw that this exposes itself
by ROT_IRQ de- asserting before CSn is actually de-asserted and
sprot continues this livelock indefinitely. If after every transaction,
the RoT waits for the CSn gpio line to actually become de-asserted
before we consider the SPI transaction complete, we know that we are
operating on a request or reply boundary, and thus the RoT and SP are
resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the sot bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after CSn was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for CSn to actually be deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have SprotProtocolError::FlowError which indicates overruns.
A missing sot bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the rxerror bit, and we
should return a FlowError. If there is an actual desynchronization,
we will detect that after the request when we poll the CHIP_SELECT
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an ssd bit from a prior
transaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the sot bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside reply. I didn't think
it was really necessary to optimize for the sys_irq_control syscall
delay with regards to setting ROT_IRQ given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, Desynchronization errors are now recoverable.

It should also be noted how adding the new Desyncrhonization error
will affect upgrades. It is a new varient to the SprotProtocolError
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

  1. Because the RoT is upgraded before the SP in mupdate this means that
    the SP code will see a SprotProtocolError::Deserialization error in
    the case of desynchronization. This is already a recoverable error and
    the behavior of the SP sprot server should be the same, except that if
    retries are exceeded for some reason, the wrong error could be plumbed
    up to the control-plane-agent and MGS. This is exceedingly unlikely for
    this specific error, except for perhaps in the flow control case where
    we use sot described above.

  2. Until MGS is updated, if the new error variant gets plumbed upwards
    it will be seen as an incompatible protocol error. This is not really
    a big deal in this case, as we are still mupdating and this is the
    only related error that can occur this way until the system is further
    upgraded.

@andrewjstone
Copy link
Contributor Author

Before merging, we'll need to merge in oxidecomputer/management-gateway-service#121. However, before we do that, we'll need to deal with oxidecomputer/management-gateway-service#123.

@andrewjstone andrewjstone mentioned this pull request Sep 8, 2023
pub timeouts: u32,

/// Number of times the RoT has reported that it was desynchronized
pub desynchronized: u32,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably increment this somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will require more refactors than I'm willing to make right now on this PR.

@andrewjstone
Copy link
Contributor Author

I should also note that this code needs testing on real hardware and not the gimletlet/rot-carrier. I'll need to update app/oxide-rot-1/app.toml to do this. The image will also have to be signed by non-bartholemew keys I believe.

The primary fix here is for the RoT to detect desynchronization between
the SP and RoT and then wait until synchronization resumes before
proceeding. The RoT and SP are desynchronized if `CSn` is still asserted
*after* the RoT sees the `ssd` bit set in the fifo status register
during its tight loops while clocking in a request or clocking out a
reply to the SP. Synchronization is resumed by the RoT waiting until CSn
is actually de-asserted before either replying to the SP or de-asserting
ROT_IRQ. Detection of actual `CSn` state is performed by polling the
`CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI
block on the RoT that sets the `ssa` and `ssd` bits.

The key problem being solved is that the `ssd` bit is saturating, and so
the RoT sprot server may look at this bit and think that the current SPI
transaction is done, even though the bit was set during a prior request.
The actual `CSn` de-assert for the current request arrives after the
RoT reads and clears the `ssd` bit, causing the bit to be set again. The
next request from the SP then comes in, the `ssa` interrupt indicating
`CSn` asserted fires and the RoT goes into a tight loop, processes a
fifo's worth of data and again immediately sees the `ssd` bit set and
ends its response early. In our traces we saw that this exposes itself
by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and
sprot continues this livelock indefinitely. If after every transaction,
the RoT waits for the `CSn` gpio line to actually become de-asserted
before we consider the SPI transaction complete, we know that we are
operating on a request or reply boundary, and thus the RoT and SP are
resynchronized.

Our salae traces from #1507 also showed that we get into this scenario
in the first place by having the RoT start reading from its fifo in
the middle of a transaction. We therefore added support for checking
the `sot` bit on the first read from the fifo to see if this is indeed
the first fifo entry being read after `CSn` was asserted. If this is
not the case the RoT immediately declares a desynchronization, waits
for `CSn` to actually be  deasserted via gpio read, replies to the
SP with the desynchronization error for visibility and moves on to
waiting for the next request. This strategy was discussed in chat with
hubris team members and causes no harm. However, after implementing
and thinking more about this it seems semantically incorrect. We
already have `SprotProtocolError::FlowError` which indicates overruns.
A missing `sot` bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the `rxerror` bit, and we
should return a `FlowError`. If there is an actual desynchronization,
we will detect that after the request when we poll the `CHIP_SELECT`
gpio pin. It is totally possible that the RoT misses the first few
bytes of an SP request but is not looking at an `ssd` bit from a prior
transaction. Informing the SP that this common flow error is a very rare
desynchronization event that is triggered on sled restarts and bumping
counters will lead to misleading debugging paths IMO, and we should
probably remove the code that uses the `sot` bit before this is merged.

There were some additional changes made for clarity and correctness.
Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and
asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think
it was really necessary to optimize for the `sys_irq_control` syscall
delay with regards to setting `ROT_IRQ` given that we have a 16 byte
fifo and then the SP pauses for 2 ms before reading the rest of a reply
larger than 16 bytes. That gives plenty of time for that syscall to
complete and for the RoT to wait for the CSn asserted IRQ wakeup after
asserting ROT_IRQ. This change makes the code more linear and removes
some unnecessary state. Testing so far has shown no difference in error
rate.

On the SP side, `Desynchronization` errors are now recoverable.

It should also be noted how adding the new `Desyncrhonization` error
will affect upgrades. It is a new varient to the `SprotProtocolError`
enum and therfore code that is unaware of this variant will not be
able to deserialize it. Specifically:

 1. Because the RoT is upgraded before the SP in mupdate this means that
the SP code will see a `SprotProtocolError::Deserialization` error in
the case of desynchronization. This is already a recoverable error and
the behavior of the SP sprot server should be the same, except that if
retries are exceeded for some reason, the wrong error could be plumbed
up to the control-plane-agent and MGS. This is exceedingly unlikely for
this specific error, except for perhaps in the flow control case where
we use `sot` described above.

 2. Until MGS is updated, if the new error variant gets plumbed upwards
it will be seen as an incompatible protocol error. This is not really
a big deal in this case, as we are still mupdating and this is the
only related error that can occur this way until the system is further
upgraded.
@labbott
Copy link
Collaborator

labbott commented Sep 8, 2023

However, after implementing
and thinking more about this it seems semantically incorrect. We
already have SprotProtocolError::FlowError which indicates overruns.
A missing sot bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the rxerror bit, and we
should return a FlowError

I'm not sure about this. Are we sure this is going to be detected via rxerror? If bytes are being dropped I would think that the hardware would still see it as the start of transfer and still set the bit.

Comment on lines +277 to +285
let (read, sot) = self.spi.read_u16_with_sot();
if first_read {
first_read = false;
if !sot {
self.stats.desynchronized =
self.stats.desynchronized.wrapping_add(1);
return Err(IoError::Desynchronized);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we removed this code wouldn't we still run into the issue where SP has started a transmission SSa gets asserted but we don't notice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it would be detected when trying to reply because CSn would still be asserted while SSD is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, after thinking more about this, I withdraw most of my concern. The code works as is, and I haven't actually seen this hit during upgrades even though I've seen a handful of rxerrors(which is on par with what our current error rates are).

@andrewjstone
Copy link
Contributor Author

However, after implementing
and thinking more about this it seems semantically incorrect. We
already have SprotProtocolError::FlowError which indicates overruns.
A missing sot bit on the first fifo read of a request is actually
just a flow control error, except that instead of the missing bytes
being in the middle of the request, they are at the beginning. In the
common case, this should be detected via the rxerror bit, and we
should return a FlowError

I'm not sure about this. Are we sure this is going to be detected via rxerror? If bytes are being dropped I would think that the hardware would still see it as the start of transfer and still set the bit.

Interesting. I thought all dropped bytes were detected by rxerror, except perhaps on startup before we initialize spi or because we mistakenly clear the error. I didn't really think about the case that the SOT bit would still be set even if early bytes were dropped. That seems quite problematic for its usage here. The problem we are solving for is missing initial bytes.

@andrewjstone
Copy link
Contributor Author

The repro test script has been running on my gimletlet/rot-carrier for over 3 hours now with no reproductions.

I also just installed a this image on a sled on madrid via faux-mgs and it installed fine and state was accessible.

@andrewjstone
Copy link
Contributor Author

The script I am running to reproduce the issue is:

#!/usr/bin/env bash


MGS_REPO=${MGS_REPO:=~/oxide/management-gateway-service}
HUBRIS_REPO=${HUBRIS_REPO:=~/oxide/hubris}
HUMILITY_ENVIRONMENT=${HUMILITY_ENVIRONMENT:=~/Users/ajs/oxide/humility_env.json}

HENV=$HUMILITY_ENVIRONMENT
rota_reset='humility -t rot-carrier --archive-name a reset'
ROT_RESET_COMMAND=${RESET_COMMAND:="$rota_reset"}

# Number of times in a row `faux-mgs state` must fail to be a test success
MAX_FAILS=${MAX_FAILS:=110}


ready_to_reset=false
failures_in_a_row=0

# We want to wait for a few times through the loop to make sure the reset
# command completes before trying again. Otherwise, we will get an error:
#
#   humility reset failed: Probe could not be created
#
#  Caused by:
#      0: hidapi error: hid_error is not implemented yet
#  	    1: hidapi error: hid_error is not implemented yet
#
MAX_GRACE_CYCLES=20
reset_grace=$MAX_GRACE_CYCLES

while true; do
  # Run our command
  output=`$MGS_REPO/target/release/faux-mgs --interface en6 --log-level critical state`
  echo "$output" | grep -q "RoT state: Ok"
  if [[ $? -eq 0 ]]; then
    failures_in_a_row=0
    echo "."
  else
    failures_in_a_row=$((failures_in_a_row+1))
    echo "+"
  fi

  # increment our grace period counter
  reset_grace=$((reset_grace+1))

  # Are we done?
  if [[ failures_in_a_row -eq MAX_FAILS ]]; then
    echo "Test Complete: found faulty sprot"
    break
  fi

  # Should we reset the RoT to try again?
  choice=$RANDOM
  sleep=$((1/(choice % 100 + 1)))
  if [[ $failures_in_a_row -eq 0 && $reset_grace -gt $MAX_GRACE_CYCLES ]]; then
    # reset our grace period counter
    reset_grace=0
    # Flip a coin to see if we should sleep before we try to reset or after
    # We right shift to create a boolean
    let "choice >>= 14"
    if [[ $choice -eq 0 ]]; then
      HUMILITY_ENV=$HENV bash -c "sleep ${sleep} && ${ROT_RESET_COMMAND} &" &
    else 
      HUMILITY_ENV=$HENV bash -c "${ROT_RESET_COMMAND} &" &
    fi 
  else
    sleep 0.01
  fi
done

first_read = false;
if !sot {
self.stats.desynchronized =
self.stats.desynchronized.wrapping_add(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe saturating_add instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this match all the other counters. I specifically did that because it seems more useful to know if a counter is moving than if it saturated at one point in time. Although, in either case we shouldn't really come close to hitting this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don't feel strongly about this!

self.stats.desynchronized =
self.stats.desynchronized.wrapping_add(1);
}
result = Err(DesynchronizedError);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe move this into the conditional, to make it more obvious that the conditional is only executed on the first transition from Ok -> Err?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call!

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good with the few changes suggested. Given how there's almost no debugging options if this link is down are there any other options for reporting an error or making it possible to get state out? We can file as a tracking item if necessary.

ringbuf_entry!(Trace::Stats(io.stats));
io.reply(&tx_buf[..rsp_len]);
}
}

struct DesynchronizedError;
impl Io {
// Wait for chip select to be asserted
// Assert ROT_IRQ if this is a reply
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment if we're not doing this part anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Begone!

}

// Reply to the SP
// Our fifos are guaranteed to be empty at this point
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could add an assert/error for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. There's also always an inherent TOCTUO here. We could check the fifos, but they may not be empty right after the check.

If they aren't empty we'll end up with some sort of error here or at the SP. I'll be more detailed in the comment.

@andrewjstone
Copy link
Contributor Author

I think this looks good with the few changes suggested. Given how there's almost no debugging options if this link is down are there any other options for reporting an error or making it possible to get state out? We can file as a tracking item if necessary.

I'm not sure how we would do that. I think we would have to raise the fault pin and then reset the RoT.

@andrewjstone andrewjstone merged commit de395a8 into master Sep 14, 2023
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants