-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syscall / context switch performance improvements #1949
base: master
Are you sure you want to change the base?
Conversation
@@ -82,6 +84,17 @@ pub trait MemoryRegion { | |||
fn contains(&self, addr: usize) -> bool; | |||
fn base_addr(&self) -> usize; | |||
fn end_addr(&self) -> usize; | |||
|
|||
#[inline(always)] | |||
fn compare(&self, addr: usize) -> Ordering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part of me wonders whether instead of having it as a method with a default implementation, we should pull this out into a separate
pub fn compare_mem_region<T: MemoryRegion>(region: &T, addr: usize) -> Ordering
because the default implementation can be overridden by an implementation of MemoryRegion
, and thus end up with incorrect behavior. but, in practice...we can just not do that, and having it be a method is nicer, i suppose... 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peanut gallery opinion: Definitely agree. If this is the only possible implementation, then an explicit method seems more correct.
Another option would be a blanket impl<T: MemoryRegion> PartialOrd<usize> for T {}
. Though I'm not sure how well that'd interact with the code at large.
@@ -159,35 +176,51 @@ where | |||
|
|||
// Per the function's preconditions, the region table is sorted in ascending | |||
// order of base address, and the regions within it do not overlap. This | |||
// lets us use a one-pass algorithm. | |||
// lets us use a binary search followed by a short scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is lovely!
and, it's really cool that we have tests for this code, now that we've gone and made it substantially more complex... :)
Because the code already requires the region to be ordered by address and non-overlapping, we can use a binary search here. This makes timing a lot more consistent -- it's now O(log n) in the region table size (which is 8 in practice) and then a potential linear scan to the end of the slice.
- Rewrite the priority_scan loop to generate simpler code with fewer bounds checks and panics. - Directly return a &Task from priority_scan and task::select, so that every user of task::select doesn't have to immediately turn around and bounds-check its returned usize against the task table. - Switch scheduling related operations to use &Task rather than &mut Task so that the return of task::select can alias other live pointers into the task table. (The use of &mut Task before was me being conservative.)
This reduces the cost of the `apply_memory_protection` phase of context switches by computing the descriptors in a const fn and sneaking them into the RegionDesc struct via a new field. This causes MPU load on ARMv7-M (Cortex-M4) to finish in 45% the time it previously took, which in turn knocks about 266 cycles off IPC-related syscalls.
This alters the v6/v7 routine to switch off the MPU before updating it, and switch it back on after, like v8 does. This has allowed me to use cheaper code to do the update. That, combined with some field ordering changes to optimize for access order in the generated code, cuts the cycles spent in this routine by roughly half.
This was generating code for the ..copy_len slicing _and_ the equality check implied by copy_from_slice. Now it just does the latter. This only takes about 3% off safe_copy, but produces smaller binaries by removing an additional panic site.
Empty copies are somewhat unusual, but appear in IPC interfaces that follow certain patterns, or in lease accesses that hit the end. Previously they were expensive nops. This makes them cheap nops. This reduces the cost of a zero-byte send by 24% (to about 928 cycles on a Cortex-M4, if you're curious).
I had gone to some lengths in the original implementation to only use safe code, and in general it worked pretty well. However, it was generating more panic sites (and checks) than I'd like. This new version is shorter, clearer (IMO), and generates somewhat faster code. By shortening the various routines involved, this reduces the mean cost of safe_copy during IPC syscalls by about 3%.
53fb987
to
04cd23a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick review of 2a56849
task::NextTask::Specific(i) => &tasks[i], | ||
task::NextTask::Other => task::select(idx, tasks), | ||
task::NextTask::Same => idx, | ||
task::NextTask::Same => &tasks[idx], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that, because the length of the task table is always HUBRIS_TASK_COUNT
and task indices are always in bounds, we could probably shave off a few instructions (and an unreachable panic site) by using get_unchecked
when indexing the task table here and elsewhere in the scheduler. I'm not sure whether this optimization is worth the unsafe code, but it's worth thinking about...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.
Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT]
references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?
}; | ||
|
||
if next == idx { | ||
if next as *const _ == &tasks[idx] as *const _ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it better to do this comparison by index rather than by address, because writing &tasks[idx]
will do a bounds check when accessing the task table, which could be avoided if we just compared the integers? OTTOH then we are using an additional register or stack slot to store both the index and the address... 🤷♀️
let next = match task::force_fault(tasks, idx, fault) { | ||
task::NextTask::Specific(i) => i, | ||
task::NextTask::Specific(i) => &tasks[i], | ||
task::NextTask::Other => task::select(idx, tasks), | ||
task::NextTask::Same => idx, | ||
task::NextTask::Same => &tasks[idx], | ||
}; | ||
|
||
if next == idx { | ||
if next as *const _ == &tasks[idx] as *const _ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above
if pos >= tasks.len() { | ||
pos = 0; | ||
} | ||
let t = &tasks[pos]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, there's probably a few instructions and a panic site that could be shaved off here by using get_unchecked
, given that the if
immediately above this line ensures that pos
is always in bounds. On the other hand, LLVM might be smart enough to figure that out here --- the code that ensures it's in bounds is straightforward enough that I wouldn't be shocked if the compiler could figure it out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked on Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥜 checking in; really cool and interesting stuff! I left a few comments, hopefully of some potential use to you <3
@@ -82,6 +84,17 @@ pub trait MemoryRegion { | |||
fn contains(&self, addr: usize) -> bool; | |||
fn base_addr(&self) -> usize; | |||
fn end_addr(&self) -> usize; | |||
|
|||
#[inline(always)] | |||
fn compare(&self, addr: usize) -> Ordering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peanut gallery opinion: Definitely agree. If this is the only possible implementation, then an explicit method seems more correct.
Another option would be a blanket impl<T: MemoryRegion> PartialOrd<usize> for T {}
. Though I'm not sure how well that'd interact with the code at large.
task::NextTask::Specific(i) => &tasks[i], | ||
task::NextTask::Other => task::select(idx, tasks), | ||
task::NextTask::Same => idx, | ||
task::NextTask::Same => &tasks[idx], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.
Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT]
references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?
}; | ||
|
||
if next == idx { | ||
if next as *const _ == &tasks[idx] as *const _ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Looking at the code, force_fault
never returns NextTask::Same
and can only return Specific(0)
. select
could theoretically return the same index but not after force_fault
since the force-faulted task isn't healthy.
So I guess this panic could only be reached if it's the supervisor that is being faulted and it is blocked in receive. Is that possible? I don't know, but I assume it's at least a thing that shouldn't happen since this panic is here :) Anyway, I wonder if it would be worth it to make up a special type for force_fault
to return instead of NextTask
that can just be into()
converted into NextTask
?
I guess it might not make much difference though: The Specific(0)
path would still want to check for idx != 0
to make sure we've not just faulted the supervisor and then asked it to wake, and the Other
path would still want to check that next != idx
just to be sure that task::select()
doesn't regress and start returning faulted tasks. Theoretically, though, it seems like there might be a good bit of unnecessary guarding going on here.
if pos >= tasks.len() { | ||
pos = 0; | ||
} | ||
let t = &tasks[pos]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked on Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).
let mut choice = None; | ||
for i in search_order { | ||
if !pred(&tasks[i]) { | ||
) -> Option<(usize, &Task)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: It might be useful to make force_fault
to also return a (usize, &Task)
pair: It would extend the lifetime of the &mut [Task]
borrow but that seems like it may not cause borrow-checker problems, based on a quick look at some usage sites. When it does pose an issue, the &Task
can of course be dropped to release the &mut [Task]
borrow for reuse.
// On ARMv6-M, there is no CLZ instruction either. This winds up | ||
// generating decent intrinsic code for `leading_zeros` so we'll live | ||
// with it. | ||
let l2size = 30 - size.leading_zeros(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Since the code is now build-time, I assume you could use trailing_zeros
.
@@ -350,11 +350,17 @@ pub fn reinitialize(task: &mut task::Task) { | |||
task.save_mut().exc_return = EXC_RETURN_CONST; | |||
} | |||
|
|||
/// PMSAv6/7-style precomputed region data. | |||
/// | |||
/// This struct is `repr(C)` to preserve the order of its fields, which happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: If I'm not badly mistaken here, it seems like only the RegionDescExt
data of the region table is relevant for applying memory protection. And that data of course goes unused by all other users of the region table, since it's only being added now.
It might thus be a good opportunity to split the region table into two parts, regions: [&'static RegionDesc; REGIONS_PER_TASK], regions_ext: [RegionDescExt; REGIONS_PER_TASK]
like. This would help with cache efficiency (assuming that the RegionDescExt
part can be brought into the TaskDesc
struct itself, or perhaps put behind a &'static [RegionDescExt; REGIONS_PER_TASK]
to avoid increasing the TaskDesc
size too much).
} | ||
|
||
#[cfg(any(armv6m, armv7m))] | ||
pub const fn compute_region_extension_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Computing the MPU contents at build-time seems like a pretty wild improvement to a very core piece of the kernel. Really cool stuff!
There's a lot of stuff in here, and I may break it up before sending it for review. Feedback welcome regardless!
These changes together reduce the time spent in syscalls in my test application by about 26%. Some calls are boosted significantly more than that (30% off large sends, 40% off empty sends).
I recommend reviewing the commits rather than the combined diff.