Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syscall / context switch performance improvements #1949

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Dec 13, 2024

There's a lot of stuff in here, and I may break it up before sending it for review. Feedback welcome regardless!

These changes together reduce the time spent in syscalls in my test application by about 26%. Some calls are boosted significantly more than that (30% off large sends, 40% off empty sends).

I recommend reviewing the commits rather than the combined diff.

@@ -82,6 +84,17 @@ pub trait MemoryRegion {
fn contains(&self, addr: usize) -> bool;
fn base_addr(&self) -> usize;
fn end_addr(&self) -> usize;

#[inline(always)]
fn compare(&self, addr: usize) -> Ordering {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of me wonders whether instead of having it as a method with a default implementation, we should pull this out into a separate

pub fn compare_mem_region<T: MemoryRegion>(region: &T, addr: usize) -> Ordering

because the default implementation can be overridden by an implementation of MemoryRegion, and thus end up with incorrect behavior. but, in practice...we can just not do that, and having it be a method is nicer, i suppose... 🤷‍♀️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peanut gallery opinion: Definitely agree. If this is the only possible implementation, then an explicit method seems more correct.

Another option would be a blanket impl<T: MemoryRegion> PartialOrd<usize> for T {}. Though I'm not sure how well that'd interact with the code at large.

@@ -159,35 +176,51 @@ where

// Per the function's preconditions, the region table is sorted in ascending
// order of base address, and the regions within it do not overlap. This
// lets us use a one-pass algorithm.
// lets us use a binary search followed by a short scan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is lovely!

and, it's really cool that we have tests for this code, now that we've gone and made it substantially more complex... :)

Because the code already requires the region to be ordered by address
and non-overlapping, we can use a binary search here. This makes timing
a lot more consistent -- it's now O(log n) in the region table size
(which is 8 in practice) and then a potential linear scan to the end of
the slice.
- Rewrite the priority_scan loop to generate simpler code with fewer
  bounds checks and panics.

- Directly return a &Task from priority_scan and task::select, so that
  every user of task::select doesn't have to immediately turn around and
  bounds-check its returned usize against the task table.

- Switch scheduling related operations to use &Task rather than &mut
  Task so that the return of task::select can alias other live pointers
  into the task table. (The use of &mut Task before was me being
  conservative.)
This reduces the cost of the `apply_memory_protection` phase of context
switches by computing the descriptors in a const fn and sneaking them
into the RegionDesc struct via a new field.

This causes MPU load on ARMv7-M (Cortex-M4) to finish in 45% the time it
previously took, which in turn knocks about 266 cycles off IPC-related
syscalls.
This alters the v6/v7 routine to switch off the MPU before updating it,
and switch it back on after, like v8 does. This has allowed me to use
cheaper code to do the update. That, combined with some field ordering
changes to optimize for access order in the generated code, cuts the
cycles spent in this routine by roughly half.
This was generating code for the ..copy_len slicing _and_ the equality
check implied by copy_from_slice. Now it just does the latter.

This only takes about 3% off safe_copy, but produces smaller binaries by
removing an additional panic site.
Empty copies are somewhat unusual, but appear in IPC interfaces that
follow certain patterns, or in lease accesses that hit the end.
Previously they were expensive nops. This makes them cheap nops.

This reduces the cost of a zero-byte send by 24% (to about 928 cycles on
a Cortex-M4, if you're curious).
I had gone to some lengths in the original implementation to only use
safe code, and in general it worked pretty well. However, it was
generating more panic sites (and checks) than I'd like.

This new version is shorter, clearer (IMO), and generates somewhat
faster code.

By shortening the various routines involved, this reduces the mean cost
of safe_copy during IPC syscalls by about 3%.
@cbiffle cbiffle force-pushed the cbiffle/syscall-perf branch from 53fb987 to 04cd23a Compare December 14, 2024 00:29
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick review of 2a56849

Comment on lines +1476 to +1478
task::NextTask::Specific(i) => &tasks[i],
task::NextTask::Other => task::select(idx, tasks),
task::NextTask::Same => idx,
task::NextTask::Same => &tasks[idx],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that, because the length of the task table is always HUBRIS_TASK_COUNT and task indices are always in bounds, we could probably shave off a few instructions (and an unreachable panic site) by using get_unchecked when indexing the task table here and elsewhere in the scheduler. I'm not sure whether this optimization is worth the unsafe code, but it's worth thinking about...

Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.

Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT] references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?

};

if next == idx {
if next as *const _ == &tasks[idx] as *const _ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it better to do this comparison by index rather than by address, because writing &tasks[idx] will do a bounds check when accessing the task table, which could be avoided if we just compared the integers? OTTOH then we are using an additional register or stack slot to store both the index and the address... 🤷‍♀️

Comment on lines 1688 to +1694
let next = match task::force_fault(tasks, idx, fault) {
task::NextTask::Specific(i) => i,
task::NextTask::Specific(i) => &tasks[i],
task::NextTask::Other => task::select(idx, tasks),
task::NextTask::Same => idx,
task::NextTask::Same => &tasks[idx],
};

if next == idx {
if next as *const _ == &tasks[idx] as *const _ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above

if pos >= tasks.len() {
pos = 0;
}
let t = &tasks[pos];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, there's probably a few instructions and a panic site that could be shaved off here by using get_unchecked, given that the if immediately above this line ensures that pos is always in bounds. On the other hand, LLVM might be smart enough to figure that out here --- the code that ensures it's in bounds is straightforward enough that I wouldn't be shocked if the compiler could figure it out...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked on Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).

Copy link
Contributor

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥜 checking in; really cool and interesting stuff! I left a few comments, hopefully of some potential use to you <3

@@ -82,6 +84,17 @@ pub trait MemoryRegion {
fn contains(&self, addr: usize) -> bool;
fn base_addr(&self) -> usize;
fn end_addr(&self) -> usize;

#[inline(always)]
fn compare(&self, addr: usize) -> Ordering {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peanut gallery opinion: Definitely agree. If this is the only possible implementation, then an explicit method seems more correct.

Another option would be a blanket impl<T: MemoryRegion> PartialOrd<usize> for T {}. Though I'm not sure how well that'd interact with the code at large.

Comment on lines +1476 to +1478
task::NextTask::Specific(i) => &tasks[i],
task::NextTask::Other => task::select(idx, tasks),
task::NextTask::Same => idx,
task::NextTask::Same => &tasks[idx],
Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.

Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT] references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?

};

if next == idx {
if next as *const _ == &tasks[idx] as *const _ {
Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Looking at the code, force_fault never returns NextTask::Same and can only return Specific(0). select could theoretically return the same index but not after force_fault since the force-faulted task isn't healthy.

So I guess this panic could only be reached if it's the supervisor that is being faulted and it is blocked in receive. Is that possible? I don't know, but I assume it's at least a thing that shouldn't happen since this panic is here :) Anyway, I wonder if it would be worth it to make up a special type for force_fault to return instead of NextTask that can just be into() converted into NextTask?

I guess it might not make much difference though: The Specific(0) path would still want to check for idx != 0 to make sure we've not just faulted the supervisor and then asked it to wake, and the Other path would still want to check that next != idx just to be sure that task::select() doesn't regress and start returning faulted tasks. Theoretically, though, it seems like there might be a good bit of unnecessary guarding going on here.

if pos >= tasks.len() {
pos = 0;
}
let t = &tasks[pos];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked on Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).

let mut choice = None;
for i in search_order {
if !pred(&tasks[i]) {
) -> Option<(usize, &Task)> {
Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: It might be useful to make force_fault to also return a (usize, &Task) pair: It would extend the lifetime of the &mut [Task] borrow but that seems like it may not cause borrow-checker problems, based on a quick look at some usage sites. When it does pose an issue, the &Task can of course be dropped to release the &mut [Task] borrow for reuse.

// On ARMv6-M, there is no CLZ instruction either. This winds up
// generating decent intrinsic code for `leading_zeros` so we'll live
// with it.
let l2size = 30 - size.leading_zeros();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Since the code is now build-time, I assume you could use trailing_zeros.

@@ -350,11 +350,17 @@ pub fn reinitialize(task: &mut task::Task) {
task.save_mut().exc_return = EXC_RETURN_CONST;
}

/// PMSAv6/7-style precomputed region data.
///
/// This struct is `repr(C)` to preserve the order of its fields, which happens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: If I'm not badly mistaken here, it seems like only the RegionDescExt data of the region table is relevant for applying memory protection. And that data of course goes unused by all other users of the region table, since it's only being added now.

It might thus be a good opportunity to split the region table into two parts, regions: [&'static RegionDesc; REGIONS_PER_TASK], regions_ext: [RegionDescExt; REGIONS_PER_TASK] like. This would help with cache efficiency (assuming that the RegionDescExt part can be brought into the TaskDesc struct itself, or perhaps put behind a &'static [RegionDescExt; REGIONS_PER_TASK] to avoid increasing the TaskDesc size too much).

}

#[cfg(any(armv6m, armv7m))]
pub const fn compute_region_extension_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Computing the MPU contents at build-time seems like a pretty wild improvement to a very core piece of the kernel. Really cool stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants