-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read SP task crash dumps over the network #1921
Conversation
16702f0
to
7ff43aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit iffy about failure modes for more than 4 clients. It seems like the only result should be bad data on the client end? A sufficiently malicious client would potentially only cause a DoS?
task/jefe/src/dump.rs
Outdated
@@ -30,6 +30,7 @@ enum Trace { | |||
GetDumpArea(u8), | |||
Base(u32), | |||
GetDumpAreaFailed(humpty::DumpError<()>), | |||
GetDumpAreaFailed2(Result<DumpArea, DumpAgentError>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a different name? DumpAreaAgentFailure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this was for debugging; removed
6735862
to
f93d5eb
Compare
offset: u32, | ||
} | ||
|
||
const MAX_DUMP_CLIENTS: usize = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 4? is this to allow two MGSes, plus...two more MGSes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly vibes, but I was thinking something like "max of two MGSes plus two tech port connections".
task/control-plane-agent/src/dump.rs
Outdated
for index in 0.. { | ||
// XXX accidentally quadratic! | ||
let data = self | ||
.agent | ||
.read_dump(index, 0) | ||
.map_err(|_e| SpError::Dump(DumpError::BadArea))?; | ||
let header = humpty::DumpAreaHeader::read_from( | ||
&data[..size_of::<humpty::DumpAreaHeader>()], | ||
) | ||
.unwrap_lite(); | ||
if header.contents == humpty::DumpContents::SingleTask.into() | ||
&& header.nsegments > 0 | ||
{ | ||
count += 1; | ||
} | ||
if header.next == 0 { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it would make sense to have the dump agent just have a counter of the number of single-task dumps, so that we could just ask it for that, instead of probing every dump area? Looking at the dump-agent implementation, that might be somewhat difficult, as reinitialize_dump_from
would have to read the reinitialized memory in order to reason about how many single-task dumps are being discarded...maybe this is the better approach?
task/control-plane-agent/src/dump.rs
Outdated
} | ||
|
||
// Move along to the next area if we're at the end | ||
if state.offset + size_of::<humpty::DumpSegmentData>() as u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considered obnoxious, feel free to ignore: size_of::<humpty::DumpSegmentData>()
and size_of::<humpty::DumpAreaHeader>()
are a little bit hard to tell apart in this function, because they're both long-ish expressions and the type parameter to size_of
starts with humpty::Dump
and my eyes glazed over at that point. I might consider doing something like
const HEADER_SIZE: u32 = size_of::<humpy::DumpAreaHeader>() as u32;
const SEGMENT_DATA_SIZE: u32 = size_of::<humpy::DumpSegmentData>() as u32;
and then referring to them by those constants elsewhere, so that it's more obvious that they're different things? but maybe this is just a "me thing", in which case you're welcome to ignore me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done in 1ddd388
#[cfg(feature = "dump")] | ||
dump_areas: u32, | ||
|
||
/// Cache of most recently checked dump area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to mention what this is used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, done
if index == prev.index { | ||
// Easy case: we've already looked up this area | ||
Ok(prev) | ||
} else if let Some(offset) = index.checked_sub(prev.index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was going to suggest that this expression might be better represented as a match
, since the two default cases could be combined, but...i think you can't write a match guard like
Some(prev) if let Some(offset) = index.checked_sub(prev.index) =>
which is a bit of a bummer. I suppose you could instead write
Some(prev) if prev.index >= index =>
and then do the subtraction in the body of the match arm, but using checked_sub
both as the check and the actual subtraction feels much nicer, and the case where the checked_sub
returns None
would still have to be handled despite never actually happening, which feels quite sad...
Yeah, that's correct – if I open 5x terminals and run
in all of them, the first one stops with |
7a7119f
to
4c7a97f
Compare
4c7a97f
to
c9421c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, modulo very small comments. Will push the fix to humpty
now to allow you to use the proper constant.
5d4b929
to
d32a1b4
Compare
d32a1b4
to
a7ef633
Compare
See `oxidecomputer/management-gateway-service#316` and `oxidecomputer/hubris#1921` for details This PR adds a new `humility hydrate` subcommand, which takes a "dry" task dump and staples it to a Hubris archive, creating a normal Hubris core file.
Hubris counterpart to oxidecomputer/management-gateway-service#316, allowing us to read task dumps over the network via MGS.
These MGS messages – unlike
dump-agent
– require more intelligence into the Hubris side. The external API is just "start read" and "continue read"; Hubris is responsible for knowing howhumpty
writes stuff into RAM.We allow for up to 4 simultaneous readers, distinguished by caller-provided keys. If there are more than four simultaneous readers, then (1) what the heck, and (2) some readers will return an error indicating that their key is no longer valid.
There are also minor tweaks to
dump-agent
andjefe
implementations:dump-agent
learns how to read dump data into a lease (instead of returning a 256-byte value)jefe
caches the most recent dump area → memory address lookup, so we don't always need to start from the beginning of the linked list. This isn't useful if we have multiple readers, but makes the single-reader case less Accidentally Quadratic.