Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

jgallagher · 2024-12-13T17:52:57Z

Today in manufacturing we had a situation where 5 sleds had rebooted from the host's perspective, but the SPs had not reset. We were unable to find any evidence on the host side about what caused the reboots. Particularly curiously, the uptimes on all 5 sleds were almost identical, so it appears all five made A2 -> A0 transitions at roughly the same time.

@rmustacc suggested an easy extension to the sequencer ringbufs that would help in a situation like this: whenever the sequencer gets a request to change the power state, can we require callers to pass in something that describes the source or cause (e.g., request via control plane agent vs request from ipcc vs ...)?

hawkw · 2024-12-13T18:17:58Z

So, we could certainly make all callers of the IPC interface provide some enum identifying themselves or something. Another potential option, though, would be to just have the sequencer server stick the task ID of the caller of Sequencer.set_state in the ringbuf. That would be a less intrusive change as it doesn't change the IPC interface, and it also avoids adding another byte to the size of the IPC message (which may not be all that important). But, the ringbuf entries might look a little less nice, as it's just an integer task ID instead of an enum with a name that Humility can interpret...

jgallagher · 2024-12-13T18:40:24Z

I'd vote update the IPC interface. host-sp-comms, at least, can ask for power state changes for a few different reasons, and it'd be good to capture those.

Currently, when the Gimlet CPU sequencer changes the system's power state, no information about *why* the power state changed is recorded by the SP firmware. A system may power off or reboot for a variety of reasons: it may be requested by the host OS over IPCC, by the control plane over the management network, or triggered by the thermal task due to an overheat condition. This makes debugging an unexpected reboot or power off difficult, as the SP ringbuffers and other diagnostics do not indicate why an unexpected power state change occurred. See #1950 for a motivating example. This commit resolves this as described in #1950 by adding a new field to the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so that the reason a power state change occurred can be recorded. Clients of the `cpu_seq` IPC API must now provide a `StateChangeReason` when calling `Sequencer.set_state`, along with the desired power state, and the sequencer task will record the provided reason in its ringbuffer. This way, we can distinguish between the various reasons a power state change may have occurred when debugging such issues. The `StateChangeReason` enum also generates counters, so that the total number of power state changes can be tracked. Fixes #1950

hawkw self-assigned this Dec 13, 2024

hawkw linked a pull request Dec 14, 2024 that will close this issue

gimlet-seq: Record why the power state changed #1953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

jgallagher commented Dec 13, 2024

hawkw commented Dec 13, 2024

jgallagher commented Dec 13, 2024

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

Comments

jgallagher commented Dec 13, 2024

hawkw commented Dec 13, 2024

jgallagher commented Dec 13, 2024