Syscall / context switch performance improvements #1949

cbiffle · 2024-12-13T02:29:17Z

There's a lot of stuff in here, and I may break it up before sending it for review. Feedback welcome regardless!

These changes together reduce the time spent in syscalls in my test application by about 26%. Some calls are boosted significantly more than that (30% off large sends, 40% off empty sends).

I recommend reviewing the commits rather than the combined diff.

hawkw · 2024-12-13T18:56:07Z

sys/kerncore/src/lib.rs

@@ -82,6 +84,17 @@ pub trait MemoryRegion {
    fn contains(&self, addr: usize) -> bool;
    fn base_addr(&self) -> usize;
    fn end_addr(&self) -> usize;
+
+    #[inline(always)]
+    fn compare(&self, addr: usize) -> Ordering {


part of me wonders whether instead of having it as a method with a default implementation, we should pull this out into a separate

pub fn compare_mem_region<T: MemoryRegion>(region: &T, addr: usize) -> Ordering

because the default implementation can be overridden by an implementation of MemoryRegion, and thus end up with incorrect behavior. but, in practice...we can just not do that, and having it be a method is nicer, i suppose... 🤷‍♀️

Peanut gallery opinion: Definitely agree. If this is the only possible implementation, then an explicit method seems more correct.

Another option would be a blanket impl<T: MemoryRegion> PartialOrd<usize> for T {}. Though I'm not sure how well that'd interact with the code at large.

hawkw · 2024-12-13T18:58:22Z

sys/kerncore/src/lib.rs

@@ -159,35 +176,51 @@ where

    // Per the function's preconditions, the region table is sorted in ascending
    // order of base address, and the regions within it do not overlap. This
-    // lets us use a one-pass algorithm.
+    // lets us use a binary search followed by a short scan


this is lovely!

and, it's really cool that we have tests for this code, now that we've gone and made it substantially more complex... :)

Because the code already requires the region to be ordered by address and non-overlapping, we can use a binary search here. This makes timing a lot more consistent -- it's now O(log n) in the region table size (which is 8 in practice) and then a potential linear scan to the end of the slice.

- Rewrite the priority_scan loop to generate simpler code with fewer bounds checks and panics. - Directly return a &Task from priority_scan and task::select, so that every user of task::select doesn't have to immediately turn around and bounds-check its returned usize against the task table. - Switch scheduling related operations to use &Task rather than &mut Task so that the return of task::select can alias other live pointers into the task table. (The use of &mut Task before was me being conservative.)

This reduces the cost of the `apply_memory_protection` phase of context switches by computing the descriptors in a const fn and sneaking them into the RegionDesc struct via a new field. This causes MPU load on ARMv7-M (Cortex-M4) to finish in 45% the time it previously took, which in turn knocks about 266 cycles off IPC-related syscalls.

This alters the v6/v7 routine to switch off the MPU before updating it, and switch it back on after, like v8 does. This has allowed me to use cheaper code to do the update. That, combined with some field ordering changes to optimize for access order in the generated code, cuts the cycles spent in this routine by roughly half.

This was generating code for the ..copy_len slicing _and_ the equality check implied by copy_from_slice. Now it just does the latter. This only takes about 3% off safe_copy, but produces smaller binaries by removing an additional panic site.

Empty copies are somewhat unusual, but appear in IPC interfaces that follow certain patterns, or in lease accesses that hit the end. Previously they were expensive nops. This makes them cheap nops. This reduces the cost of a zero-byte send by 24% (to about 928 cycles on a Cortex-M4, if you're curious).

I had gone to some lengths in the original implementation to only use safe code, and in general it worked pretty well. However, it was generating more panic sites (and checks) than I'd like. This new version is shorter, clearer (IMO), and generates somewhat faster code. By shortening the various routines involved, this reduces the mean cost of safe_copy during IPC syscalls by about 3%.

hawkw

quick review of 2a56849

hawkw · 2024-12-14T20:03:14Z

sys/kern/src/arch/arm_m.rs

+            task::NextTask::Specific(i) => &tasks[i],
            task::NextTask::Other => task::select(idx, tasks),
-            task::NextTask::Same => idx,
+            task::NextTask::Same => &tasks[idx],


It occurs to me that, because the length of the task table is always HUBRIS_TASK_COUNT and task indices are always in bounds, we could probably shave off a few instructions (and an unreachable panic site) by using get_unchecked when indexing the task table here and elsewhere in the scheduler. I'm not sure whether this optimization is worth the unsafe code, but it's worth thinking about...

Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.

Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT] references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?

hawkw · 2024-12-14T20:04:33Z

sys/kern/src/arch/arm_m.rs

        };

-        if next == idx {
+        if next as *const _ == &tasks[idx] as *const _ {


isn't it better to do this comparison by index rather than by address, because writing &tasks[idx] will do a bounds check when accessing the task table, which could be avoided if we just compared the integers? OTTOH then we are using an additional register or stack slot to store both the index and the address... 🤷‍♀️

hawkw · 2024-12-14T20:04:55Z

sys/kern/src/arch/arm_m.rs

        let next = match task::force_fault(tasks, idx, fault) {
-            task::NextTask::Specific(i) => i,
+            task::NextTask::Specific(i) => &tasks[i],
            task::NextTask::Other => task::select(idx, tasks),
-            task::NextTask::Same => idx,
+            task::NextTask::Same => &tasks[idx],
        };

-        if next == idx {
+        if next as *const _ == &tasks[idx] as *const _ {


same comments as above

hawkw · 2024-12-14T20:06:47Z

sys/kern/src/task.rs

+        if pos >= tasks.len() {
+            pos = 0;
+        }
+        let t = &tasks[pos];


similarly, there's probably a few instructions and a panic site that could be shaved off here by using get_unchecked, given that the if immediately above this line ensures that pos is always in bounds. On the other hand, LLVM might be smart enough to figure that out here --- the code that ensures it's in bounds is straightforward enough that I wouldn't be shocked if the compiler could figure it out...

I checked on Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).

aapoalas

🥜 checking in; really cool and interesting stuff! I left a few comments, hopefully of some potential use to you <3

aapoalas · 2024-12-15T16:25:29Z

sys/kerncore/src/lib.rs

@@ -82,6 +84,17 @@ pub trait MemoryRegion {
    fn contains(&self, addr: usize) -> bool;
    fn base_addr(&self) -> usize;
    fn end_addr(&self) -> usize;
+
+    #[inline(always)]
+    fn compare(&self, addr: usize) -> Ordering {


Peanut gallery opinion: Definitely agree. If this is the only possible implementation, then an explicit method seems more correct.

Another option would be a blanket impl<T: MemoryRegion> PartialOrd<usize> for T {}. Though I'm not sure how well that'd interact with the code at large.

aapoalas · 2024-12-15T16:48:49Z

sys/kern/src/arch/arm_m.rs

+            task::NextTask::Specific(i) => &tasks[i],
            task::NextTask::Other => task::select(idx, tasks),
-            task::NextTask::Same => idx,
+            task::NextTask::Same => &tasks[idx],


Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.

Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT] references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?

aapoalas · 2024-12-15T17:18:03Z

sys/kern/src/arch/arm_m.rs

        };

-        if next == idx {
+        if next as *const _ == &tasks[idx] as *const _ {


thought: Looking at the code, force_fault never returns NextTask::Same and can only return Specific(0). select could theoretically return the same index but not after force_fault since the force-faulted task isn't healthy.

So I guess this panic could only be reached if it's the supervisor that is being faulted and it is blocked in receive. Is that possible? I don't know, but I assume it's at least a thing that shouldn't happen since this panic is here :) Anyway, I wonder if it would be worth it to make up a special type for force_fault to return instead of NextTask that can just be into() converted into NextTask?

I guess it might not make much difference though: The Specific(0) path would still want to check for idx != 0 to make sure we've not just faulted the supervisor and then asked it to wake, and the Other path would still want to check that next != idx just to be sure that task::select() doesn't regress and start returning faulted tasks. Theoretically, though, it seems like there might be a good bit of unnecessary guarding going on here.

aapoalas · 2024-12-15T17:33:15Z

sys/kern/src/task.rs

+        if pos >= tasks.len() {
+            pos = 0;
+        }
+        let t = &tasks[pos];


I checked on Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).

aapoalas · 2024-12-15T17:35:43Z

sys/kern/src/task.rs

-    let mut choice = None;
-    for i in search_order {
-        if !pred(&tasks[i]) {
+) -> Option<(usize, &Task)> {


thought: It might be useful to make force_fault to also return a (usize, &Task) pair: It would extend the lifetime of the &mut [Task] borrow but that seems like it may not cause borrow-checker problems, based on a quick look at some usage sites. When it does pose an issue, the &Task can of course be dropped to release the &mut [Task] borrow for reuse.

aapoalas · 2024-12-15T17:51:47Z

sys/kern/src/arch/arm_m.rs

+    // On ARMv6-M, there is no CLZ instruction either. This winds up
+    // generating decent intrinsic code for `leading_zeros` so we'll live
+    // with it.
+    let l2size = 30 - size.leading_zeros();


thought: Since the code is now build-time, I assume you could use trailing_zeros.

aapoalas · 2024-12-15T18:20:17Z

sys/kern/src/arch/arm_m.rs

@@ -350,11 +350,17 @@ pub fn reinitialize(task: &mut task::Task) {
    task.save_mut().exc_return = EXC_RETURN_CONST;
 }

+/// PMSAv6/7-style precomputed region data.
+///
+/// This struct is `repr(C)` to preserve the order of its fields, which happens


suggestion: If I'm not badly mistaken here, it seems like only the RegionDescExt data of the region table is relevant for applying memory protection. And that data of course goes unused by all other users of the region table, since it's only being added now.

It might thus be a good opportunity to split the region table into two parts, regions: [&'static RegionDesc; REGIONS_PER_TASK], regions_ext: [RegionDescExt; REGIONS_PER_TASK] like. This would help with cache efficiency (assuming that the RegionDescExt part can be brought into the TaskDesc struct itself, or perhaps put behind a &'static [RegionDescExt; REGIONS_PER_TASK] to avoid increasing the TaskDesc size too much).

aapoalas · 2024-12-15T18:29:14Z

sys/kern/src/arch/arm_m.rs

+}
+
+#[cfg(any(armv6m, armv7m))]
+pub const fn compute_region_extension_data(


praise: Computing the MPU contents at build-time seems like a pretty wild improvement to a very core piece of the kernel. Really cool stuff!

hawkw reviewed Dec 13, 2024

View reviewed changes

cbiffle added 7 commits December 13, 2024 16:28

umem: remove unnecessary extra slice check

23e44df

This was generating code for the ..copy_len slicing _and_ the equality check implied by copy_from_slice. Now it just does the latter. This only takes about 3% off safe_copy, but produces smaller binaries by removing an additional panic site.

cbiffle force-pushed the cbiffle/syscall-perf branch from 53fb987 to 04cd23a Compare December 14, 2024 00:29

hawkw reviewed Dec 14, 2024

View reviewed changes

aapoalas reviewed Dec 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syscall / context switch performance improvements #1949

Syscall / context switch performance improvements #1949

cbiffle commented Dec 13, 2024

hawkw Dec 13, 2024

aapoalas Dec 15, 2024

hawkw Dec 13, 2024

hawkw left a comment

hawkw Dec 14, 2024

aapoalas Dec 15, 2024 •

edited

Loading

hawkw Dec 14, 2024

hawkw Dec 14, 2024

hawkw Dec 14, 2024

aapoalas Dec 15, 2024

aapoalas left a comment

aapoalas Dec 15, 2024

aapoalas Dec 15, 2024 •

edited

Loading

aapoalas Dec 15, 2024 •

edited

Loading

aapoalas Dec 15, 2024

aapoalas Dec 15, 2024 •

edited

Loading

aapoalas Dec 15, 2024

aapoalas Dec 15, 2024

aapoalas Dec 15, 2024

Syscall / context switch performance improvements #1949

Are you sure you want to change the base?

Syscall / context switch performance improvements #1949

Conversation

cbiffle commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

aapoalas Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas Dec 15, 2024 •

edited

Loading

aapoalas Dec 15, 2024 •

edited

Loading

aapoalas Dec 15, 2024 •

edited

Loading

aapoalas Dec 15, 2024 •

edited

Loading