-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950
Comments
So, we could certainly make all callers of the IPC interface provide some enum identifying themselves or something. Another potential option, though, would be to just have the sequencer server stick the task ID of the caller of |
I'd vote update the IPC interface. |
Currently, when the Gimlet CPU sequencer changes the system's power state, no information about *why* the power state changed is recorded by the SP firmware. A system may power off or reboot for a variety of reasons: it may be requested by the host OS over IPCC, by the control plane over the management network, or triggered by the thermal task due to an overheat condition. This makes debugging an unexpected reboot or power off difficult, as the SP ringbuffers and other diagnostics do not indicate why an unexpected power state change occurred. See #1950 for a motivating example. This commit resolves this as described in #1950 by adding a new field to the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so that the reason a power state change occurred can be recorded. Clients of the `cpu_seq` IPC API must now provide a `StateChangeReason` when calling `Sequencer.set_state`, along with the desired power state, and the sequencer task will record the provided reason in its ringbuffer. This way, we can distinguish between the various reasons a power state change may have occurred when debugging such issues. The `StateChangeReason` enum also generates counters, so that the total number of power state changes can be tracked. Fixes #1950
Today in manufacturing we had a situation where 5 sleds had rebooted from the host's perspective, but the SPs had not reset. We were unable to find any evidence on the host side about what caused the reboots. Particularly curiously, the uptimes on all 5 sleds were almost identical, so it appears all five made A2 -> A0 transitions at roughly the same time.
@rmustacc suggested an easy extension to the sequencer ringbufs that would help in a situation like this: whenever the sequencer gets a request to change the power state, can we require callers to pass in something that describes the source or cause (e.g., request via control plane agent vs request from ipcc vs ...)?
The text was updated successfully, but these errors were encountered: