Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The primary fix here is for the RoT to detect desynchronization between the SP and RoT and then wait until synchronization resumes before proceeding. The RoT and SP are desynchronized if `CSn` is still asserted *after* the RoT sees the `ssd` bit set in the fifo status register during its tight loops while clocking in a request or clocking out a reply to the SP. Synchronization is resumed by the RoT waiting until CSn is actually de-asserted before either replying to the SP or de-asserting ROT_IRQ. Detection of actual `CSn` state is performed by polling the `CHIP_SELECT` gpio pin that corresponds to the `CSn` pin used by the SPI block on the RoT that sets the `ssa` and `ssd` bits. The key problem being solved is that the `ssd` bit is saturating, and so the RoT sprot server may look at this bit and think that the current SPI transaction is done, even though the bit was set during a prior request. The actual `CSn` de-assert for the current request arrives after the RoT reads and clears the `ssd` bit, causing the bit to be set again. The next request from the SP then comes in, the `ssa` interrupt indicating `CSn` asserted fires and the RoT goes into a tight loop, processes a fifo's worth of data and again immediately sees the `ssd` bit set and ends its response early. In our traces we saw that this exposes itself by `ROT_IRQ` de- asserting before `CSn` is actually de-asserted and sprot continues this livelock indefinitely. If after every transaction, the RoT waits for the `CSn` gpio line to actually become de-asserted before we consider the SPI transaction complete, we know that we are operating on a request or reply boundary, and thus the RoT and SP are resynchronized. Our salae traces from #1507 also showed that we get into this scenario in the first place by having the RoT start reading from its fifo in the middle of a transaction. We therefore added support for checking the `sot` bit on the first read from the fifo to see if this is indeed the first fifo entry being read after `CSn` was asserted. If this is not the case the RoT immediately declares a desynchronization, waits for `CSn` to actually be deasserted via gpio read, replies to the SP with the desynchronization error for visibility and moves on to waiting for the next request. This strategy was discussed in chat with hubris team members and causes no harm. However, after implementing and thinking more about this it seems semantically incorrect. We already have `SprotProtocolError::FlowError` which indicates overruns. A missing `sot` bit on the first fifo read of a request is actually just a flow control error, except that instead of the missing bytes being in the middle of the request, they are at the beginning. In the common case, this should be detected via the `rxerror` bit, and we should return a `FlowError`. If there is an actual desynchronization, we will detect that after the request when we poll the `CHIP_SELECT` gpio pin. It is totally possible that the RoT misses the first few bytes of an SP request but is not looking at an `ssd` bit from a prior transaction. Informing the SP that this common flow error is a very rare desynchronization event that is triggered on sled restarts and bumping counters will lead to misleading debugging paths IMO, and we should probably remove the code that uses the `sot` bit before this is merged. There were some additional changes made for clarity and correctness. Cleanup of fifos, errors, and ssa/ssd bits is now self-contained, and asserting and de-asserting ROT_IRQ happen inside `reply`. I didn't think it was really necessary to optimize for the `sys_irq_control` syscall delay with regards to setting `ROT_IRQ` given that we have a 16 byte fifo and then the SP pauses for 2 ms before reading the rest of a reply larger than 16 bytes. That gives plenty of time for that syscall to complete and for the RoT to wait for the CSn asserted IRQ wakeup after asserting ROT_IRQ. This change makes the code more linear and removes some unnecessary state. Testing so far has shown no difference in error rate. On the SP side, `Desynchronization` errors are now recoverable. It should also be noted how adding the new `Desyncrhonization` error will affect upgrades. It is a new varient to the `SprotProtocolError` enum and therfore code that is unaware of this variant will not be able to deserialize it. Specifically: 1. Because the RoT is upgraded before the SP in mupdate this means that the SP code will see a `SprotProtocolError::Deserialization` error in the case of desynchronization. This is already a recoverable error and the behavior of the SP sprot server should be the same, except that if retries are exceeded for some reason, the wrong error could be plumbed up to the control-plane-agent and MGS. This is exceedingly unlikely for this specific error, except for perhaps in the flow control case where we use `sot` described above. 2. Until MGS is updated, if the new error variant gets plumbed upwards it will be seen as an incompatible protocol error. This is not really a big deal in this case, as we are still mupdating and this is the only related error that can occur this way until the system is further upgraded.
- Loading branch information