-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gimlet-seq: Record why the power state changed #1953
base: master
Are you sure you want to change the base?
Conversation
Currently, when the Gimlet CPU sequencer changes the system's power state, no information about *why* the power state changed is recorded by the SP firmware. A system may power off or reboot for a variety of reasons: it may be requested by the host OS over IPCC, by the control plane over the management network, or triggered by the thermal task due to an overheat condition. This makes debugging an unexpected reboot or power off difficult, as the SP ringbuffers and other diagnostics do not indicate why an unexpected power state change occurred. See #1950 for a motivating example. This commit resolves this as described in #1950 by adding a new field to the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so that the reason a power state change occurred can be recorded. Clients of the `cpu_seq` IPC API must now provide a `StateChangeReason` when calling `Sequencer.set_state`, along with the desired power state, and the sequencer task will record the provided reason in its ringbuffer. This way, we can distinguish between the various reasons a power state change may have occurred when debugging such issues. The `StateChangeReason` enum also generates counters, so that the total number of power state changes can be tracked. Fixes #1950
Currently, the `Trace::SetState` ringbuf entry in the sequencer is a tuple-like enum variant. This entry includes two `PowerState` fields, one recording the previous power state and the other recording the new power state that has been set. IMHO, using a tuple-like variant to represent this is a bit unfortunate, as in Humility, we'll see two values of the same type and it's not immediately obvious which is the previous state and which is the new state. This must be determined based on the order of the fields in the ringbuf entry, which requires referencing the Hubris code to determine. I felt like it was nicer to just use a struct-like variant with named fields for this. That way, the semantic meaning of the two `PowerState`s is actually encoded in the debug info, and Humility can just indicate which is the previous state and which is the new state when displaying the ring buffer. I also think it's a bit nicer to name the timestamp field --- otherwise, it just looks like some arbitrary integer, and you need to look at the code to determine that it's the timestamp of the power state change. If this is controversial for some reason, I'm happy to land it in a separate PR, but I figured it was nice to do while I was messing with the sequencer ringbuf.
drv/cpu-seq-api/src/lib.rs
Outdated
pub enum StateChangeReason { | ||
/// TThe system has just received power, so the sequencer has booted the | ||
/// host CPU. | ||
InitialPowerOn = 1, | ||
/// A power state change was requested by the control plane. | ||
ControlPlane, | ||
/// The host OS requested that the system power off without rebooting. | ||
HostPowerOff, | ||
/// The host OS requested that the system reboot. | ||
HostReboot, | ||
/// The system powered off because a component has overheated. | ||
Overheat, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very open to suggestions about the naming of these variants, if anyone dislikes the ones I came up with...
HostPowerOff, | ||
/// The host OS requested that the system reboot. | ||
HostReboot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also considered making these one thing, like
HostPowerOff { reboot: bool }
but that makes this enum two bytes, and probably also embiggens the ringbuf entry...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might get the desired effect with another enum; HostPowerOff(PowerOffStyle)
enum PowerOffStyle {
Shutdown,
Reboot
}
As I understand, as long as there aren't multiple enum variants with a value, and the only value is a non-repr enum then Rust will smash the two together. At least, that's what I've seen happening in one place in code I work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I didn't realize Rust would do that when the outermost enum has a repr
attribute. That's kinda nice! I'll look into it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, though, I'm not sure if that's really any nicer than the current thing, in this case --- the only reason I would've done it with a bool
is so that the caller that uses this variant, the power_off_host
function in task-host-sp-comms
, could just pass its reboot
field rather than having to say if reboot { ... }
:
hubris/task/host-sp-comms/src/main.rs
Lines 376 to 386 in 63995e0
fn power_off_host(&mut self, reboot: bool) { | |
let reason = if reboot { | |
drv_cpu_seq_api::StateChangeReason::HostReboot | |
} else { | |
drv_cpu_seq_api::StateChangeReason::HostPowerOff | |
}; | |
loop { | |
// Attempt to move to A2; given we only call this function in | |
// response to a host request, we expect we're currently in A0 and | |
// this should work. | |
let err = match self.sequencer.set_state(PowerState::A2, reason) { |
But, if we add another enum here, it still has to do that, so. I'm not sure if this is worth changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was really confused and surprised when I found out as well, but it's pretty nice at times.
You could implement a From<bool>
for the inner enum to hide the if-else away. That being said, it's arguably not that much nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that a From<bool>
implementation would only be used in one place (power_off_host
in task-host-sp-comms
) but would have to be defined in cpu-seq-api
, I think I'll leave it as-is. I think it's easier to understand when the conversion from the reboot: bool
to a StateChangeReason
occurs in the one place where it actually happens, instead of hiding it behind a From
impl in a separate crate that isn't used anywhere else.
Thanks for the suggestions, though!
@@ -58,6 +58,7 @@ impl idl::InOrderSequencerImpl for ServerImpl { | |||
&mut self, | |||
_: &RecvMessage, | |||
state: PowerState, | |||
_: StateChangeReason, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add a ringbuffer to the mock sequencer implementation on Gimletlet, to aid in debugging in its callers?
thanks @aapoalas! Co-authored-by: Aapo Alasuutari <[email protected]>
/// The host OS requested that the system power off without rebooting. | ||
HostPowerOff, | ||
/// The host OS requested that the system reboot. | ||
HostReboot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I realize it's a bit tricky, I think it'd be useful to distinguish between a reboot due to an IPCC panic and a reboot. Unfortunately the panic command doesn't itself indicate a reboot, but a subsequent one from the host with no additional action taken will indicate a panic. It may not be relevant at this particular point, but it depends a bit on where we see us generating an alert that the system has rebooted from. That is, it's not clear to me that we'd want to do that from all of the different callers versus use the sequencer as a generator.
I also somewhat expected to see a distinction between overheat and MAPO related actions, but the latter may be because if the FPGA sequencer triggers a MAPO event we are seeing that in a different way.
If someone uses humility to manually turn on/off the host power how is that noted through here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I realize it's a bit tricky, I think it'd be useful to distinguish between a reboot due to an IPCC panic and a reboot. Unfortunately the panic command doesn't itself indicate a reboot, but a subsequent one from the host with no additional action taken will indicate a panic.
Definitely agree that it would be good to differentiate between these, I'll see about having the host-sp-comms task track a "was there just a panic" bit so we can thread that through here.
I also somewhat expected to see a distinction between overheat and MAPO related actions, but the latter may be because if the FPGA sequencer triggers a MAPO event we are seeing that in a different way.
As I understand it, MAPO events won't go through the Sequencer.set_state
IPC interface, and won't record the SetState
ringbuf entry. Instead, I think Hubris' understanding of MAPO is entirely internal to the sequencer task --- I think the relevant ringbuf entries are the ones we record in a0_failure
here. @bcantrill or @cbiffle would know better, but that's my understanding, at least.
This branch contains two commits, the first of which is the main one, and the
second of which is an opportunistic little refactor that I thought was worth
doing while I was in the neighborhood:
gimlet-seq: Record why the power state changed (436f5dd)
Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about why the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950 for a
motivating example.
This commit resolves this as described in Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950 by adding a new field to
the
SetState
variant in thedrv-gimlet-seq-server
ringbuffer, sothat the reason a power state change occurred can be recorded. Clients
of the
cpu_seq
IPC API must now provide aStateChangeReason
whencalling
Sequencer.set_state
, along with the desired power state, andthe sequencer task will record the provided reason in its ringbuffer.
This way, we can distinguish between the various reasons a power state
change may have occurred when debugging such issues.
The
StateChangeReason
enum also generates counters, so that the totalnumber of power state changes can be tracked.
Fixes Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions #1950
Caution
THIS IS A BREAKING CHANGE for any scripts that use
humility hiffy
to call into the
Sequencer.set_state
IPC API. I'm not sure how big adeal this is, as I know we'd like to be exposing separate, stable interfaces
for stuff like this eventually. However, if we don't want to break any scripts
that currently exist, we could also go down a path of having separate
Sequencer.set_state
andSequencer.set_state_with_reason
IPCs, and changeall the Hubris tasks to use the latter, but allow Hiffy callers to still use the
former without having to add a new argument. Callers of
Sequencer.set_state
would then be recorded in the ringbuf with someStateChangeReason::Other
orStateChangeReason::RandomHiffyPoke
kinda variant.
This feels sad to me, but I'm willing to go down this route if necessary.
gimlet-seq: Name
SetState
ringbuf entry fields (7b43a72)Currently, the
Trace::SetState
ringbuf entry in the sequencer is atuple-like enum variant. This entry includes two
PowerState
fields,one recording the previous power state and the other recording the new
power state that has been set. IMHO, using a tuple-like variant to
represent this is a bit unfortunate, as in Humility, we'll see two
values of the same type and it's not immediately obvious which is the
previous state and which is the new state. This must be determined based
on the order of the fields in the ringbuf entry, which requires
referencing the Hubris code to determine.
I felt like it was nicer to just use a struct-like variant with named
fields for this. That way, the semantic meaning of the two
PowerState
sis actually encoded in the debug info, and Humility can just indicate
which is the previous state and which is the new state when displaying
the ring buffer. I also think it's a bit nicer to name the timestamp
field --- otherwise, it just looks like some arbitrary integer, and you
need to look at the code to determine that it's the timestamp of the
power state change.
If this is controversial for some reason, I'm happy to land it in a
separate PR, but I figured it was nice to do while I was messing with
the sequencer ringbuf.