Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read SP task crash dumps over the network #1921

Merged
merged 16 commits into from
Dec 10, 2024
Merged

Conversation

mkeeter
Copy link
Collaborator

@mkeeter mkeeter commented Nov 18, 2024

Hubris counterpart to oxidecomputer/management-gateway-service#316, allowing us to read task dumps over the network via MGS.

These MGS messages – unlike dump-agent – require more intelligence into the Hubris side. The external API is just "start read" and "continue read"; Hubris is responsible for knowing how humpty writes stuff into RAM.

We allow for up to 4 simultaneous readers, distinguished by caller-provided keys. If there are more than four simultaneous readers, then (1) what the heck, and (2) some readers will return an error indicating that their key is no longer valid.

There are also minor tweaks to dump-agent and jefe implementations:

  • dump-agent learns how to read dump data into a lease (instead of returning a 256-byte value)
  • jefe caches the most recent dump area → memory address lookup, so we don't always need to start from the beginning of the linked list. This isn't useful if we have multiple readers, but makes the single-reader case less Accidentally Quadratic.

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit iffy about failure modes for more than 4 clients. It seems like the only result should be bad data on the client end? A sufficiently malicious client would potentially only cause a DoS?

app/gimlet/base.toml Show resolved Hide resolved
@@ -30,6 +30,7 @@ enum Trace {
GetDumpArea(u8),
Base(u32),
GetDumpAreaFailed(humpty::DumpError<()>),
GetDumpAreaFailed2(Result<DumpArea, DumpAgentError>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a different name? DumpAreaAgentFailure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was for debugging; removed

@mkeeter mkeeter force-pushed the mkeeter/mgs-task-dumps branch from 6735862 to f93d5eb Compare November 18, 2024 19:24
offset: u32,
}

const MAX_DUMP_CLIENTS: usize = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 4? is this to allow two MGSes, plus...two more MGSes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly vibes, but I was thinking something like "max of two MGSes plus two tech port connections".

Comment on lines 44 to 79
for index in 0.. {
// XXX accidentally quadratic!
let data = self
.agent
.read_dump(index, 0)
.map_err(|_e| SpError::Dump(DumpError::BadArea))?;
let header = humpty::DumpAreaHeader::read_from(
&data[..size_of::<humpty::DumpAreaHeader>()],
)
.unwrap_lite();
if header.contents == humpty::DumpContents::SingleTask.into()
&& header.nsegments > 0
{
count += 1;
}
if header.next == 0 {
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it would make sense to have the dump agent just have a counter of the number of single-task dumps, so that we could just ask it for that, instead of probing every dump area? Looking at the dump-agent implementation, that might be somewhat difficult, as reinitialize_dump_from would have to read the reinitialized memory in order to reason about how many single-task dumps are being discarded...maybe this is the better approach?

}

// Move along to the next area if we're at the end
if state.offset + size_of::<humpty::DumpSegmentData>() as u32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered obnoxious, feel free to ignore: size_of::<humpty::DumpSegmentData>() and size_of::<humpty::DumpAreaHeader>() are a little bit hard to tell apart in this function, because they're both long-ish expressions and the type parameter to size_of starts with humpty::Dump and my eyes glazed over at that point. I might consider doing something like

const HEADER_SIZE: u32 = size_of::<humpy::DumpAreaHeader>() as u32;
const SEGMENT_DATA_SIZE: u32 = size_of::<humpy::DumpSegmentData>() as u32;

and then referring to them by those constants elsewhere, so that it's more obvious that they're different things? but maybe this is just a "me thing", in which case you're welcome to ignore me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in 1ddd388

#[cfg(feature = "dump")]
dump_areas: u32,

/// Cache of most recently checked dump area
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to mention what this is used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, done

task/jefe/src/main.rs Outdated Show resolved Hide resolved
if index == prev.index {
// Easy case: we've already looked up this area
Ok(prev)
} else if let Some(offset) = index.checked_sub(prev.index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was going to suggest that this expression might be better represented as a match, since the two default cases could be combined, but...i think you can't write a match guard like

Some(prev) if let Some(offset) = index.checked_sub(prev.index) =>

which is a bit of a bummer. I suppose you could instead write

Some(prev) if prev.index >= index => 

and then do the subtraction in the body of the match arm, but using checked_sub both as the check and the actual subtraction feels much nicer, and the case where the checked_sub returns None would still have to be handled despite never actually happening, which feels quite sad...

@mkeeter
Copy link
Collaborator Author

mkeeter commented Nov 20, 2024

I'm a little bit iffy about failure modes for more than 4 clients. It seems like the only result should be bad data on the client end? A sufficiently malicious client would potentially only cause a DoS?

Yeah, that's correct – if I open 5x terminals and run

faux-mgs --log-level=debug --interface en0 dump read --index 0

in all of them, the first one stops with Error: Error response from SP: dump: invalid key

@mkeeter mkeeter force-pushed the mkeeter/mgs-task-dumps branch 2 times, most recently from 7a7119f to 4c7a97f Compare November 20, 2024 16:30
@mkeeter mkeeter force-pushed the mkeeter/mgs-task-dumps branch from 4c7a97f to c9421c6 Compare December 6, 2024 21:22
Copy link
Collaborator

@bcantrill bcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, modulo very small comments. Will push the fix to humpty now to allow you to use the proper constant.

Cargo.toml Outdated Show resolved Hide resolved
task/control-plane-agent/src/dump.rs Outdated Show resolved Hide resolved
task/control-plane-agent/src/dump.rs Outdated Show resolved Hide resolved
task/control-plane-agent/src/dump.rs Outdated Show resolved Hide resolved
task/control-plane-agent/src/dump.rs Outdated Show resolved Hide resolved
@mkeeter mkeeter force-pushed the mkeeter/mgs-task-dumps branch from 5d4b929 to d32a1b4 Compare December 10, 2024 19:56
@mkeeter mkeeter force-pushed the mkeeter/mgs-task-dumps branch from d32a1b4 to a7ef633 Compare December 10, 2024 20:07
@mkeeter mkeeter merged commit 1757b8b into master Dec 10, 2024
125 checks passed
@mkeeter mkeeter deleted the mkeeter/mgs-task-dumps branch December 10, 2024 20:39
mkeeter added a commit to oxidecomputer/humility that referenced this pull request Dec 10, 2024
See `oxidecomputer/management-gateway-service#316` and
`oxidecomputer/hubris#1921` for details

This PR adds a new `humility hydrate` subcommand, which takes a "dry" task dump
and staples it to a Hubris archive, creating a normal Hubris core file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants