Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

Open
jgallagher opened this issue Dec 13, 2024 · 2 comments · May be fixed by #1953
Open

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

jgallagher opened this issue Dec 13, 2024 · 2 comments · May be fixed by #1953
Assignees

Comments

@jgallagher
Copy link
Contributor

Today in manufacturing we had a situation where 5 sleds had rebooted from the host's perspective, but the SPs had not reset. We were unable to find any evidence on the host side about what caused the reboots. Particularly curiously, the uptimes on all 5 sleds were almost identical, so it appears all five made A2 -> A0 transitions at roughly the same time.

@rmustacc suggested an easy extension to the sequencer ringbufs that would help in a situation like this: whenever the sequencer gets a request to change the power state, can we require callers to pass in something that describes the source or cause (e.g., request via control plane agent vs request from ipcc vs ...)?

@hawkw hawkw self-assigned this Dec 13, 2024
@hawkw
Copy link
Member

hawkw commented Dec 13, 2024

So, we could certainly make all callers of the IPC interface provide some enum identifying themselves or something. Another potential option, though, would be to just have the sequencer server stick the task ID of the caller of Sequencer.set_state in the ringbuf. That would be a less intrusive change as it doesn't change the IPC interface, and it also avoids adding another byte to the size of the IPC message (which may not be all that important). But, the ringbuf entries might look a little less nice, as it's just an integer task ID instead of an enum with a name that Humility can interpret...

@jgallagher
Copy link
Contributor Author

I'd vote update the IPC interface. host-sp-comms, at least, can ask for power state changes for a few different reasons, and it'd be good to capture those.

hawkw added a commit that referenced this issue Dec 14, 2024
Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about *why* the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See #1950 for a
motivating example.

This commit resolves this as described in #1950 by adding a new field to
the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so
that the reason a power state change occurred can be recorded. Clients
of the `cpu_seq` IPC API must now provide a `StateChangeReason` when
calling `Sequencer.set_state`, along with the desired power state, and
the sequencer task will record the provided reason in its ringbuffer.
This way, we can distinguish between the various reasons a power state
change may have occurred when debugging such issues.

The `StateChangeReason` enum also generates counters, so that the total
number of power state changes can be tracked.

Fixes #1950
@hawkw hawkw linked a pull request Dec 14, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants