Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syscall / context switch performance improvements #1949

Merged
merged 2 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 11 additions & 14 deletions sys/kern/src/arch/arm_m.rs
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ pub fn apply_memory_protection(task: &task::Task) {
}
}

pub fn start_first_task(tick_divisor: u32, task: &mut task::Task) -> ! {
pub fn start_first_task(tick_divisor: u32, task: &task::Task) -> ! {
// Enable faults and set fault/exception priorities to reasonable settings.
// Our goal here is to keep the kernel non-preemptive, which means the
// kernel entry points (SVCall, PendSV, SysTick, interrupt handlers) must be
Expand Down Expand Up @@ -654,7 +654,7 @@ pub fn start_first_task(tick_divisor: u32, task: &mut task::Task) -> ! {
mpu.ctrl.write(ENABLE | PRIVDEFENA);
}

CURRENT_TASK_PTR.store(task, Ordering::Relaxed);
CURRENT_TASK_PTR.store(task as *const _ as *mut _, Ordering::Relaxed);

extern "C" {
// Exposed by the linker script.
Expand Down Expand Up @@ -901,9 +901,9 @@ cfg_if::cfg_if! {
/// This records a pointer that aliases `task`. As long as you don't read that
/// pointer while you have access to `task`, and as long as the `task` being
/// stored is actually in the task table, you'll be okay.
pub unsafe fn set_current_task(task: &mut task::Task) {
CURRENT_TASK_PTR.store(task, Ordering::Relaxed);
crate::profiling::event_context_switch(task as *mut _ as usize);
pub unsafe fn set_current_task(task: &task::Task) {
CURRENT_TASK_PTR.store(task as *const _ as *mut _, Ordering::Relaxed);
crate::profiling::event_context_switch(task as *const _ as usize);
}

/// Reads the tick counter.
Expand Down Expand Up @@ -1079,7 +1079,6 @@ unsafe extern "C" fn pendsv_entry() {

with_task_table(|tasks| {
let next = task::select(current, tasks);
let next = &mut tasks[next];
apply_memory_protection(next);
// Safety: next comes from the task table and we don't use it again
// until next kernel entry, so we meet set_current_task's requirements.
Expand Down Expand Up @@ -1474,16 +1473,15 @@ unsafe extern "C" fn handle_fault(task: *mut task::Task) {
// switch to a task to run.
with_task_table(|tasks| {
let next = match task::force_fault(tasks, idx, fault) {
task::NextTask::Specific(i) => i,
task::NextTask::Specific(i) => &tasks[i],
task::NextTask::Other => task::select(idx, tasks),
task::NextTask::Same => idx,
task::NextTask::Same => &tasks[idx],
Comment on lines +1476 to +1478
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that, because the length of the task table is always HUBRIS_TASK_COUNT and task indices are always in bounds, we could probably shave off a few instructions (and an unreachable panic site) by using get_unchecked when indexing the task table here and elsewhere in the scheduler. I'm not sure whether this optimization is worth the unsafe code, but it's worth thinking about...

Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I hadn't realized that the number of tasks is actually known at the kernel build time as well.

Is there a reason for why that knowledge isn't more widely used in the kernel, actually? Specifically, why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT] references? I just tested this locally, and the changes needed to do that are quite minimal indeed. Would that cause eg. problems with debugging, compatibility, or code size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably shave off a few instructions (and an unreachable panic site) by using get_unchecked

That's probably true but we'd want to be very careful about where we do it.

why don't all these methods work with &(mut) [Task; HUBRIS_TASK_COUNT] references?

There's a historical reason why we don't use fixed-length arrays for the task table, but it's not really valid anymore. (The number of tasks was not originally available to the kernel at compile time.)

In practice, however, all of these functions get specialized by the compiler to fixed-length arrays. When disassembling the kernel you see a lot of cmp r0, #6 and stuff for validation. Making the types explicit might be an interesting simplification -- relying on the compiler for stuff like this can be tricky.

I'd prefer not to pile either of those changes on this PR if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entirely reasonable on both counts, we can think more about this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed: Since I already have the changes locally, I'll try my hand at figuring out the asm output at play and see if this helps LLVM or just duplicates its work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I couldn't find any difference in output (except my mind or different checked out commit from this PR playing tricks on me).

I can still open a PR, if you'd like: The full patch file is only 434 lines long, with 44 actual lines changed I think. Very small, guarantees the optimisation and it might have some effect somewhere, even though I was unable to find that place (I managed to only really analyse select and safe_copy.)

};

if next == idx {
if core::ptr::eq(next as *const _, task as *const _) {
panic!("attempt to return to Task #{idx} after fault");
}

let next = &mut tasks[next];
apply_memory_protection(next);
// Safety: next comes from the task table and we don't use it again
// until next kernel entry, so we meet set_current_task's requirements.
Expand Down Expand Up @@ -1688,16 +1686,15 @@ unsafe extern "C" fn handle_fault(
// fault!)
with_task_table(|tasks| {
let next = match task::force_fault(tasks, idx, fault) {
task::NextTask::Specific(i) => i,
task::NextTask::Specific(i) => &tasks[i],
task::NextTask::Other => task::select(idx, tasks),
task::NextTask::Same => idx,
task::NextTask::Same => &tasks[idx],
};

if next == idx {
if core::ptr::eq(next as *const _, task as *const _) {
panic!("attempt to return to Task #{idx} after fault");
}

let next = &mut tasks[next];
apply_memory_protection(next);
// Safety: this leaks a pointer aliasing next into static scope, but
// we're not going to read it back until the next kernel entry, so we
Expand Down
10 changes: 3 additions & 7 deletions sys/kern/src/startup.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,15 +82,11 @@ pub unsafe fn start_kernel(tick_divisor: u32) -> ! {

// Great! Pick our first task. We'll act like we're scheduling after the
// last task, which will cause a scan from 0 on.
let first_task_index =
crate::task::select(task_table.len() - 1, task_table);
let first_task = crate::task::select(task_table.len() - 1, task_table);

crate::arch::apply_memory_protection(&task_table[first_task_index]);
crate::arch::apply_memory_protection(first_task);
TASK_TABLE_IN_USE.store(false, Ordering::Release);
crate::arch::start_first_task(
tick_divisor,
&mut task_table[first_task_index],
)
crate::arch::start_first_task(tick_divisor, first_task)
}

/// Runs `body` with a reference to the task table.
Expand Down
8 changes: 4 additions & 4 deletions sys/kern/src/syscalls.rs
Original file line number Diff line number Diff line change
Expand Up @@ -99,14 +99,14 @@ pub unsafe extern "C" fn syscall_entry(nr: u32, task: *mut Task) {
NextTask::Specific(i) => {
// Safety: this is a valid task from the tasks table, meeting
// switch_to's requirements.
unsafe { switch_to(&mut tasks[i]) }
unsafe { switch_to(&tasks[i]) }
}

NextTask::Other => {
let next = task::select(idx, tasks);
// Safety: this is a valid task from the tasks table, meeting
// switch_to's requirements.
unsafe { switch_to(&mut tasks[next]) }
unsafe { switch_to(next) }
}
}
});
Expand Down Expand Up @@ -322,7 +322,7 @@ fn recv(tasks: &mut [Task], caller: usize) -> Result<NextTask, UserError> {
let mut last = caller; // keep track of scan position.

// Is anyone blocked waiting to send to us?
while let Some(sender) = task::priority_scan(last, tasks, |t| {
while let Some((sender, _)) = task::priority_scan(last, tasks, |t| {
t.state().is_sending_to(caller_id)
}) {
// Oh hello sender!
Expand Down Expand Up @@ -667,7 +667,7 @@ fn borrow_lease(
/// To avoid causing problems, ensure that `task` is a member of the task table,
/// with memory protection generated by the build system, and that your access
/// to `task` goes out of scope before next kernel entry.
unsafe fn switch_to(task: &mut Task) {
unsafe fn switch_to(task: &Task) {
arch::apply_memory_protection(task);
// Safety: our contract above is sufficient to ensure that this is safe.
unsafe {
Expand Down
42 changes: 25 additions & 17 deletions sys/kern/src/task.rs
Original file line number Diff line number Diff line change
Expand Up @@ -820,47 +820,55 @@ pub fn check_task_id_against_table(
/// Selects a new task to run after `previous`. Tries to be fair, kind of.
///
/// If no tasks are runnable, the kernel panics.
pub fn select(previous: usize, tasks: &[Task]) -> usize {
priority_scan(previous, tasks, |t| t.is_runnable())
.expect("no tasks runnable")
pub fn select(previous: usize, tasks: &[Task]) -> &Task {
match priority_scan(previous, tasks, |t| t.is_runnable()) {
Some((_index, task)) => task,
None => panic!(),
}
}

/// Scans the task table to find a prioritized candidate.
///
/// Scans `tasks` for the next task, after `previous`, that satisfies `pred`. If
/// more than one task satisfies `pred`, returns the most important one. If
/// multiple tasks with the same priority satisfy `pred`, prefers the first one
/// in order after `previous`, mod `tasks.len()`.
/// in order after `previous`, mod `tasks.len()`. Finally, if no tasks satisfy
/// `pred`, returns `None`
///
/// Whew.
///
/// This is generally the right way to search a task table, and is used to
/// implement (among other bits) the scheduler.
///
/// # Panics
///
/// If `previous` is not a valid index in `tasks`.
/// On success, the return value is the task's index in the task table, and a
/// direct reference to the task.
pub fn priority_scan(
previous: usize,
tasks: &[Task],
pred: impl Fn(&Task) -> bool,
) -> Option<usize> {
uassert!(previous < tasks.len());
let search_order = (previous + 1..tasks.len()).chain(0..previous + 1);
let mut choice = None;
for i in search_order {
if !pred(&tasks[i]) {
) -> Option<(usize, &Task)> {
Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: It might be useful to make force_fault to also return a (usize, &Task) pair: It would extend the lifetime of the &mut [Task] borrow but that seems like it may not cause borrow-checker problems, based on a quick look at some usage sites. When it does pose an issue, the &Task can of course be dropped to release the &mut [Task] borrow for reuse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a scan of the use sites, it looks like most want a scheduling hint (NextTask) as their return value. I hesitate to add a lifetime parameter to NextTask, as it's quite commonly used and would extend a lot of borrows in ways that might prove awkward.

The intent here would be to eliminate possible bounds checks in the consumption of the NextTask value it currently returns? I'd want to double-check that there are bounds checks being generated today before adding the complexity, but it might be an improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was indeed thinking of eliminating bounds checks.

I'll see about checking this option out (at a lower priority), and if it proves to be an easy change then I'll check the asm output to see if there is any potential benefit there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably also makes no difference, or is negative. The unsafe extern "C" fn handle_fault function includes a ton of handle_fault calls but already does not seemingly have a single bounds check panic site. Trying to remove from that would obviously only cause an unsigned integer overflow panic.

let mut pos = previous;
let mut choice: Option<(usize, &Task)> = None;
for _step_no in 0..tasks.len() {
pos = pos.wrapping_add(1);
if pos >= tasks.len() {
pos = 0;
}
let t = &tasks[pos];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, there's probably a few instructions and a panic site that could be shaved off here by using get_unchecked, given that the if immediately above this line ensures that pos is always in bounds. On the other hand, LLVM might be smart enough to figure that out here --- the code that ensures it's in bounds is straightforward enough that I wouldn't be shocked if the compiler could figure it out...

Copy link
Contributor

@aapoalas aapoalas Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with Godbolt and the panics are indeed optimized out by the compiler (with the correct target in use, thumv6m-none-eabi, and opt-level=2/s/z).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this (and basically all other cases where an access is dominated by a naive i < slice.len()) is optimized out. I'm trying to only apply unsafe in cases where I can't figure out how to get the panics removed from the safe code, and this code specifically does not produce a panic rn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in that case, let's leave it as is! Good job, LLVM!

if !pred(t) {
continue;
}

if let Some((_, prio)) = choice {
if !tasks[i].priority.is_more_important_than(prio) {
if let Some((_, best_task)) = choice {
if !t.priority.is_more_important_than(best_task.priority) {
continue;
}
}

choice = Some((i, tasks[i].priority));
choice = Some((pos, t));
}

choice.map(|(idx, _)| idx)
choice
}

/// Puts a task into a forced fault condition.
Expand Down
79 changes: 58 additions & 21 deletions sys/kerncore/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
#![cfg_attr(not(test), no_std)]
#![forbid(clippy::wildcard_imports)]

use core::cmp::Ordering;

/// Describes types that act as "slices" (in the very abstract sense) referenced
/// by tasks in syscalls.
///
Expand Down Expand Up @@ -84,6 +86,23 @@ pub trait MemoryRegion {
fn end_addr(&self) -> usize;
}

/// Compares a memory region to an address for use in binary-searching a region
/// table.
///
/// This will return `Equal` if the address falls within the region, `Greater`
/// if the address is lower, `Less` if the address is higher. i.e. it returns
/// the status of the region relative to the address, not vice versa.
#[inline(always)]
fn region_compare(region: &impl MemoryRegion, addr: usize) -> Ordering {
if addr < region.base_addr() {
Ordering::Greater
} else if addr >= region.end_addr() {
Ordering::Less
} else {
Ordering::Equal
}
}

impl<T: MemoryRegion> MemoryRegion for &T {
#[inline(always)]
fn contains(&self, addr: usize) -> bool {
Expand Down Expand Up @@ -159,35 +178,53 @@ where

// Per the function's preconditions, the region table is sorted in ascending
// order of base address, and the regions within it do not overlap. This
// lets us use a one-pass algorithm.
// lets us use a binary search followed by a short scan
cbiffle marked this conversation as resolved.
Show resolved Hide resolved
let mut scan_addr = slice.base_addr();
let end_addr = slice.end_addr();

for region in table {
if region.contains(scan_addr) {
// Make sure it's permissible!
if !region_ok(region) {
// bail to the fail handling code at the end.
break;
}

if end_addr <= region.end_addr() {
// We've exhausted the slice in this region, we don't have
// to continue processing.
return true;
}
let Ok(index) =
table.binary_search_by(|reg| region_compare(reg, scan_addr))
else {
// No region contained the start address.
return false;
};

// Perform fast checks on the initial region. In practical testing this
// provides a ~1% performance improvement over only using the loop below.
let first_region = &table[index];
if !region_ok(first_region) {
return false;
}
// Advance to the end of the first region
scan_addr = first_region.end_addr();
if scan_addr >= end_addr {
// That was easy
Comment on lines +198 to +201
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a super tiny, unimportant copyedit nit that i feel quite silly for commenting on: all the other comments in this function are terminated by periods, but these aren't (exclamation point optional):

Suggested change
// Advance to the end of the first region
scan_addr = first_region.end_addr();
if scan_addr >= end_addr {
// That was easy
// Advance to the end of the first region.
scan_addr = first_region.end_addr();
if scan_addr >= end_addr {
// That was easy!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¡but what if I want it at the front though

return true;
}

// Continue scanning at the end of this region.
scan_addr = region.end_addr();
} else if region.base_addr() > scan_addr {
// We've passed our target address without finding regions that
// work!
// Scan adjacent regions.
for region in &table[index + 1..] {
if !region.contains(scan_addr) {
// We've hit a hole without finishing our scan.
break;
}
// Make sure the region is permissible!
if !region_ok(region) {
// bail to the fail handling code at the end.
break;
Comment on lines +209 to 214
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum, the "fail handling code at the end" is just false...so couldn't these just be return false;? I suppose it makes sense to have them not be if we anticipate probably adding additional code that runs when no valid region is found, but...I dunno. Not a big deal, as the current thing is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually can't remember why I factored it this way. I could go either way.

Copy link
Member

@hawkw hawkw Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♀️ well, it's not terribly important to me --- i just went hunting for what the fail handling code at the end might be, and then, when I saw that it was just false, kind of went, "wait, what am I missing here?" for a bit.

}

if end_addr <= region.end_addr() {
// This region contains the end of our slice! We made it!
return true;
}

// Continue scanning at the end of this region.
scan_addr = region.end_addr();
}

// We reach this point by exhausting the region table, or finding a
// region at a higher address than the slice.
// We reach this point by exhausting the region table without reaching the
// end of the slice, or hitting a hole.
false
}

Expand Down
Loading