Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gimlet-seq: Record why the power state changed #1953

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 14, 2024

This branch contains two commits, the first of which is the main one, and the
second of which is an opportunistic little refactor that I thought was worth
doing while I was in the neighborhood:

  • gimlet-seq: Record why the power state changed (436f5dd)

    Currently, when the Gimlet CPU sequencer changes the system's power
    state, no information about why the power state changed is recorded by
    the SP firmware. A system may power off or reboot for a variety of
    reasons: it may be requested by the host OS over IPCC, by the control
    plane over the management network, or triggered by the thermal task due
    to an overheat condition. This makes debugging an unexpected reboot or
    power off difficult, as the SP ringbuffers and other diagnostics do not
    indicate why an unexpected power state change occurred. See Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950 for a
    motivating example.

    This commit resolves this as described in Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950 by adding a new field to
    the SetState variant in the drv-gimlet-seq-server ringbuffer, so
    that the reason a power state change occurred can be recorded. Clients
    of the cpu_seq IPC API must now provide a StateChangeReason when
    calling Sequencer.set_state, along with the desired power state, and
    the sequencer task will record the provided reason in its ringbuffer.
    This way, we can distinguish between the various reasons a power state
    change may have occurred when debugging such issues.

    The StateChangeReason enum also generates counters, so that the total
    number of power state changes can be tracked.

    Fixes Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950

Caution

THIS IS A BREAKING CHANGE for any scripts that use humility hiffy
to call into the Sequencer.set_state IPC API. I'm not sure how big a
deal this is, as I know we'd like to be exposing separate, stable interfaces
for stuff like this eventually. However, if we don't want to break any scripts
that currently exist, we could also go down a path of having separate
Sequencer.set_state and Sequencer.set_state_with_reason IPCs, and change
all the Hubris tasks to use the latter, but allow Hiffy callers to still use the
former without having to add a new argument. Callers of
Sequencer.set_state would then be recorded in the ringbuf with some
StateChangeReason::Other or StateChangeReason::RandomHiffyPoke
kinda variant.

This feels sad to me, but I'm willing to go down this route if necessary.

  • gimlet-seq: Name SetState ringbuf entry fields (7b43a72)

    Currently, the Trace::SetState ringbuf entry in the sequencer is a
    tuple-like enum variant. This entry includes two PowerState fields,
    one recording the previous power state and the other recording the new
    power state that has been set. IMHO, using a tuple-like variant to
    represent this is a bit unfortunate, as in Humility, we'll see two
    values of the same type and it's not immediately obvious which is the
    previous state and which is the new state. This must be determined based
    on the order of the fields in the ringbuf entry, which requires
    referencing the Hubris code to determine.

    I felt like it was nicer to just use a struct-like variant with named
    fields for this. That way, the semantic meaning of the two PowerStates
    is actually encoded in the debug info, and Humility can just indicate
    which is the previous state and which is the new state when displaying
    the ring buffer. I also think it's a bit nicer to name the timestamp
    field --- otherwise, it just looks like some arbitrary integer, and you
    need to look at the code to determine that it's the timestamp of the
    power state change.

    If this is controversial for some reason, I'm happy to land it in a
    separate PR, but I figured it was nice to do while I was messing with
    the sequencer ringbuf.

Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about *why* the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See #1950 for a
motivating example.

This commit resolves this as described in #1950 by adding a new field to
the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so
that the reason a power state change occurred can be recorded. Clients
of the `cpu_seq` IPC API must now provide a `StateChangeReason` when
calling `Sequencer.set_state`, along with the desired power state, and
the sequencer task will record the provided reason in its ringbuffer.
This way, we can distinguish between the various reasons a power state
change may have occurred when debugging such issues.

The `StateChangeReason` enum also generates counters, so that the total
number of power state changes can be tracked.

Fixes #1950
Currently, the `Trace::SetState` ringbuf entry in the sequencer is a
tuple-like enum variant. This entry includes two `PowerState` fields,
one recording the previous power state and the other recording the new
power state that has been set. IMHO, using a tuple-like variant to
represent this is a bit unfortunate, as in Humility, we'll see two
values of the same type and it's not immediately obvious which is the
previous state and which is the new state. This must be determined based
on the order of the fields in the ringbuf entry, which requires
referencing the Hubris code to determine.

I felt like it was nicer to just use a struct-like variant with named
fields for this. That way, the semantic meaning of the two `PowerState`s
is actually encoded in the debug info, and Humility can just indicate
which is the previous state and which is the new state when displaying
the ring buffer. I also think it's a bit nicer to name the timestamp
field --- otherwise, it just looks like some arbitrary integer, and you
need to look at the code to determine that it's the timestamp of the
power state change.

If this is controversial for some reason, I'm happy to land it in a
separate PR, but I figured it was nice to do while I was messing with
the sequencer ringbuf.
@hawkw hawkw requested review from cbiffle and jgallagher December 14, 2024 19:57
@hawkw
Copy link
Member Author

hawkw commented Dec 14, 2024

Comment on lines 37 to 49
pub enum StateChangeReason {
/// TThe system has just received power, so the sequencer has booted the
/// host CPU.
InitialPowerOn = 1,
/// A power state change was requested by the control plane.
ControlPlane,
/// The host OS requested that the system power off without rebooting.
HostPowerOff,
/// The host OS requested that the system reboot.
HostReboot,
/// The system powered off because a component has overheated.
Overheat,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very open to suggestions about the naming of these variants, if anyone dislikes the ones I came up with...

Comment on lines +44 to +46
HostPowerOff,
/// The host OS requested that the system reboot.
HostReboot,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered making these one thing, like

HostPowerOff { reboot: bool }

but that makes this enum two bytes, and probably also embiggens the ringbuf entry...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might get the desired effect with another enum; HostPowerOff(PowerOffStyle)

enum PowerOffStyle {
  Shutdown,
  Reboot
}

As I understand, as long as there aren't multiple enum variants with a value, and the only value is a non-repr enum then Rust will smash the two together. At least, that's what I've seen happening in one place in code I work with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I didn't realize Rust would do that when the outermost enum has a repr attribute. That's kinda nice! I'll look into it, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, though, I'm not sure if that's really any nicer than the current thing, in this case --- the only reason I would've done it with a bool is so that the caller that uses this variant, the power_off_host function in task-host-sp-comms, could just pass its reboot field rather than having to say if reboot { ... }:

fn power_off_host(&mut self, reboot: bool) {
let reason = if reboot {
drv_cpu_seq_api::StateChangeReason::HostReboot
} else {
drv_cpu_seq_api::StateChangeReason::HostPowerOff
};
loop {
// Attempt to move to A2; given we only call this function in
// response to a host request, we expect we're currently in A0 and
// this should work.
let err = match self.sequencer.set_state(PowerState::A2, reason) {

But, if we add another enum here, it still has to do that, so. I'm not sure if this is worth changing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really confused and surprised when I found out as well, but it's pretty nice at times.

You could implement a From<bool> for the inner enum to hide the if-else away. That being said, it's arguably not that much nicer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that a From<bool> implementation would only be used in one place (power_off_host in task-host-sp-comms) but would have to be defined in cpu-seq-api, I think I'll leave it as-is. I think it's easier to understand when the conversion from the reboot: bool to a StateChangeReason occurs in the one place where it actually happens, instead of hiding it behind a From impl in a separate crate that isn't used anywhere else.

Thanks for the suggestions, though!

@@ -58,6 +58,7 @@ impl idl::InOrderSequencerImpl for ServerImpl {
&mut self,
_: &RecvMessage,
state: PowerState,
_: StateChangeReason,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a ringbuffer to the mock sequencer implementation on Gimletlet, to aid in debugging in its callers?

/// The host OS requested that the system power off without rebooting.
HostPowerOff,
/// The host OS requested that the system reboot.
HostReboot,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I realize it's a bit tricky, I think it'd be useful to distinguish between a reboot due to an IPCC panic and a reboot. Unfortunately the panic command doesn't itself indicate a reboot, but a subsequent one from the host with no additional action taken will indicate a panic. It may not be relevant at this particular point, but it depends a bit on where we see us generating an alert that the system has rebooted from. That is, it's not clear to me that we'd want to do that from all of the different callers versus use the sequencer as a generator.

I also somewhat expected to see a distinction between overheat and MAPO related actions, but the latter may be because if the FPGA sequencer triggers a MAPO event we are seeing that in a different way.

If someone uses humility to manually turn on/off the host power how is that noted through here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I realize it's a bit tricky, I think it'd be useful to distinguish between a reboot due to an IPCC panic and a reboot. Unfortunately the panic command doesn't itself indicate a reboot, but a subsequent one from the host with no additional action taken will indicate a panic.

Definitely agree that it would be good to differentiate between these, I'll see about having the host-sp-comms task track a "was there just a panic" bit so we can thread that through here.

I also somewhat expected to see a distinction between overheat and MAPO related actions, but the latter may be because if the FPGA sequencer triggers a MAPO event we are seeing that in a different way.

As I understand it, MAPO events won't go through the Sequencer.set_state IPC interface, and won't record the SetState ringbuf entry. Instead, I think Hubris' understanding of MAPO is entirely internal to the sequencer task --- I think the relevant ringbuf entries are the ones we record in a0_failure here. @bcantrill or @cbiffle would know better, but that's my understanding, at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions
3 participants