Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Implement BPF DSQ iterator #185

Merged
merged 17 commits into from
Apr 23, 2024
Merged

Implement BPF DSQ iterator #185

merged 17 commits into from
Apr 23, 2024

Conversation

htejun
Copy link
Collaborator

@htejun htejun commented Apr 21, 2024

This allows iterating tasks in a non-local DSQ in [reverse] dispatch order from any context. While iterating in ops.dispatch(), specific tasks can be consumed in any order too.

@htejun htejun requested a review from Byte-Lab April 21, 2024 17:52
@htejun htejun force-pushed the htejun/dsq-iter branch 3 times, most recently from e7f9e0c to e6a13d0 Compare April 21, 2024 20:22
Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass -- will do a complete review tomorrow

init/init_task.c Show resolved Hide resolved
kernel/sched/ext.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, this will be super useful. Left a few comments

include/linux/sched/ext.h Show resolved Hide resolved
INIT_LIST_HEAD(&kit->cursor.list);
RB_CLEAR_NODE(&kit->cursor.priq);
kit->cursor.flags = SCX_TASK_DSQ_CURSOR;
kit->dsq_seq = kit->dsq->seq;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

READ_ONCE()?

kernel/sched/ext.c Show resolved Hide resolved
@@ -258,7 +260,7 @@ static void update_core_sched_head_seq(struct task_struct *p)
void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
{
struct cpu_ctx *cpuc;
u32 zero = 0;
u32 zero = 0, batch = dsp_batch ?: 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it'd be slightly clearer if we just initialized dsp_batch to 1?

kernel/sched/ext.c Show resolved Hide resolved
kernel/sched/ext.c Outdated Show resolved Hide resolved
kernel/sched/ext.c Outdated Show resolved Hide resolved
tools/sched_ext/scx_qmap.bpf.c Show resolved Hide resolved
Comment on lines 287 to 294
bpf_for(i, 0, sizeof(exp_prefix)) {
if (exp_prefix[i] == '\0')
break;
if (exp_prefix[i] != comm[i]) {
match = false;
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bpf_strncmp() might be useful here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't take p->comm unfortunately. :(

tools/sched_ext/scx_qmap.bpf.c Outdated Show resolved Hide resolved
kf_mask manipulations shouldn't be reordered when seen from CPU itself. Tell
compiler about it.
Factor out sched_ext_entity initialization into init_scx_entity() which
memsets it to zero at the beginning to avoid needing 0 inits.
- Relocate scx_task_iter definition so that it's together with iter
  functions.

- Add find_user_dsq() instead of open-coding rhashtable_lookup_fast().

- Relocate scx_bpf_nr_cpu_ids() close to its cpumask friends.

- Reorder kfunc decls in scx/common.bpf.h to match kernel def order.

- Remove unnecessary 0 inits from init_task initializer. While init_task
  initializers have other 0 inits, it's far from being consistent and e.g.
  rb_node doesn't even have usable initializer. Let's just initialize what's
  necessary.

No functional changes.
The function takes @Mask but ignores it and uses hard-coded
__SCX_KF_RQ_LOCKED instead. Fix it. The only user was calling with
__SCX_KF_RQ_LOCKED, so this doesn't cause any behavior changes.
It's currently marked ops-only but can actually be called from any context.
Let's make it safe.
None of the functions in scx_kfunc_set_ops_only set are ops-only.
scx_bpf_select_cpu_dfl() is select_cpu-only and the rest can be called from
anywhere. Remove the set and move the kfuncs to scx_kfunc_set_any and the
new scx_kfunc_set_select_cpu.

No functional changes intended.
dsq->nr is read without locking from scx_bpf_dsq_nr_queued(). Use
WRITE_ONCE() to update and READ_ONCE() from scx_bpf_dsq_nr_queued().
We'll use the list_head for both fifo and priq. Let's rename it to list.

No functional changes intended.
priq will updated so that the tasks are added to the list in vtime order
too. Thus, caching the first node won't be necessary. Let's switch to
uncached variant.
When a DSQ is used as a PRIQ, tasks were only kept on dsq->priq. This
requires all other paths to check both dsq->priq and dsq->list. A task's
scx.dsq_node.list is available when the task is on a PRIQ DSQ anyway, let's
always put tasks on dsq->list whether it's used as a PRIQ or FIFO by
inserting into dsq->list in vtime order on PRIQs.

This means that dsq->list is always in dispatch order whether the DSQ is
FIFO or PRIQ and the only paths that need to worry about the DSQ type are
dispatch_enqueue() and task_unlink_from_dsq().
and give p->scx.dsq_node's type the name of scx_dsq_node. This will be used
to implement DSQ iterator.

This wastes 8 bytes on 64bit. Oh well.
Implement dsq->list iteration helper nldsq_next_task() and
nldsq_for_each_task(). As the name suggests, these are to be used only with
non-local DSQs. Currently, these don't have any functional differences to raw
list iteration. Future BPF DSQ iteration addition will modify the iteration
behavior.
- Can iterate non-local DSQs.

- Can be used from any context.

- Only tasks which are already queued at the time of iterator init are
  visited.

- Can iterate in either dispatch or reverse dispatch order. For PRIQs, vtime
  order is guaranteed to monotonically increase or decrease depending on the
  walk direction.

- scx_qmap updated to demonstrate the usage.
- Factor out consume_local_task() and consume_remote_task()

- s/task_can_run_on_rq/task_can_run_on_remote_rq()/ and improve UP handling.

This will be used to implement targeted task consumption while iterating
DSQs.
- Unfurl the conditions for readability.
- Add missing task_cpu_possible() test.
- Comments.
This should behave the same. This is to prepare for the next patch which
will add a usage example of targeted consumption.
This allows consuming a specific task while iterating a DSQ using the BPF
iterator significantly increasing DSQ's flexibility. It has to jump through
some hoops to work around BPF restrictions which hopefully can be removed in
the future. scx_qmap is updated with a silly priority boost mechanism to
demonstrate the usage.
Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looks fantastic. Left one optional comment but this looks good to go

kernel/sched/ext.c Show resolved Hide resolved
@htejun htejun merged commit 2ab30f7 into sched_ext Apr 23, 2024
1 check passed
@htejun htejun deleted the htejun/dsq-iter branch April 23, 2024 20:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants