-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with Rayon should mention ParallelIterator
#5
Comments
(Oh, this is a familiar name! 👋 ) Oh wow, I was not aware of this at all! I will attempt to correct any unfair comparison against Rayon, but it might take some days as I have busy schedule these week and I'd also like to understand it fully.
I'll have to admit that I haven't used Rayon extensively before and I mainly knew about its overall design (work-stealing) and not its inner details. However, I did go to the documentation and did honestly attempt to pick the most efficient method: My thinking process was follows:
And following the documentation for Could we fix the documentation maybe? Highlight that Also, after reading a tiny bit about
Aaaaah, I think maybe here I've made the wrong conclusion. I've seen plenty of comments on forums/Reddit from people using Rayon and seeing code becoming much slower. I attributed this to the overhead (the first point), but maybe all of these reports are actually around thread contention. |
Nice running into you, it's been a while! 👋
The docs do say that the high-level parallel constructs are "typically the most efficient", so to me there is an implication that you should use them if possible, and that you are incurring a bit more responsibility with regard to performance if you choose to use
"Tiny" is relative. 🙂 No threads are spawned, there are usually no heap allocations (the data for both closures remains on the stack), and dynamic dispatch is avoided where possible. The overhead is not negligible in comparison to the size of the work items in the benchmark (essentially a single ADD instruction), or in comparison to the single atomic load that Spice's
When called from a thread which is in the thread pool, You can more or less think of |
It doesn't seem like To understand more how it's working I made a slight tweak to the tree to make it more unbalanced: Rust code for making an unbalanced treeThe intention is for the tree to look something like this:
Sorry if there are some off-by-one bugs. pub fn make_unbalanced_tree(from: i64, to: i64, left_size: i64) -> Node<i64> {
if left_size == 0 {
return make_balanced_tree(from, to);
}
let value = from + left_size;
return Node {
value: value,
left: Some(Box::new(make_balanced_tree(from, value - 1))),
right: Some(Box::new(make_unbalanced_tree(value + 1, to, left_size / 2))),
};
} When I create a (I'd also say that this type of uneven balanced seem pretty common to me: Imagine a webpage which has one huge table deeply nested somewhere. Overall the DOM is balanced and nice, but then there's this one heavy subtree – which also is internally balanced nicely.) It seems that in order to use What Rayon's I still think |
You're right,
Because your entire program probably doesn't consist of a single call to Here's a simple modification to const THRESHOLD: usize = 10_000;
fn sum_timer_inner(pool: &rayon::ThreadPool, node: &Node<i64>, iter: &mut usize) -> i64 {
*iter += 1;
if let (Some(left), Some(right)) = (&node.left, &node.right) {
if *iter >= THRESHOLD {
let (left_result, right_result) = pool.join(
|| sum_timer_inner(pool, left, &mut 0),
|| sum_timer_inner(pool, right, &mut 0),
);
return node.value + left_result + right_result;
} else {
return sum_timer_inner(pool, left, iter) + sum_timer_inner(pool, right, iter);
}
}
let mut result = node.value;
if let Some(child) = &node.left {
result += sum_timer_inner(pool, child, iter);
}
if let Some(child) = &node.right {
result += sum_timer_inner(pool, child, iter);
}
return result;
}
pub fn sum_timer(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
sum_timer_inner(pool, node, &mut 0)
} Here are the results for 10,000,000 nodes:
1,000 nodes:
And for
Here, the |
Yes! I agree with everything you're saying. The intention of this benchmark is to show an apples-to-apples comparison of using the same API, not to claim that it's impossible for Rayon to handle this problem. You're also proving the value proposition of Spice though: With Spice you don't have to think about any of this nor tweak any parameters. I've opened a PR to clarify the comparison, but I'm also open for other suggestions of how to phrase the README. I find it a bit hard to balance between "explain everything" and "getting to the point", especially as different readers have different understanding of the fork/join mechanism.
And a small note: Is this the actual tree? In which case it's not very unbalanced since the last parameter is the size of the balanced left-child. I'm not sure that it matters though. It does look like this depth-based granularity control would be able to handle this type of unbalanced tree. But yea, now you have a manual parameter to tweak. |
Yeah, agreed.
Whoops, sorry, that was just a silly mistake on my part (I misread the arguments to
In fact, I would expect it to be able to handle any shape of unbalanced binary tree just about as well as the Spice version. The |
Ah, yes of course, you're right. Using the numbers from the benchmarks: This ensures that we're doing 10000 of those ~7 ns operations before adding ~15 ns of overhead. That makes the overhead negligible. (I think you're missing a |
Rayon has a
ParallelIterator
trait which is based around recursively splitting tasks into subtasks until the set of workers in the pool is saturated.ParallelIterator
is fairly central to Rayon's overall design, and I would argue that thesum_rayon
implementation in the benchmark here, which callsjoin
for every single node in the tree, is really not idiomatic:Here's an alternate version that makes use of Rayon's
iter::split
function:When I benchmark these two implementations on a tree with 10,000,000 nodes (on a Intel Core i5-8400 CPU with 6 cores), the version that uses
split
is significantly faster than the version that callsjoin
for every node, and is never slower than the baseline:However, it is true that the
split
version still shows the same poor scaling behavior for a tree with only 1,000 nodes:This is because the task is being eagerly split up to saturate the available workers without any regard for how small it is. Of course, there are usually good ad hoc solutions to this in the context of any particular problem: in this case, you could introduce a heuristic to make
split
bottom out for tasks below a certain size, andIndexedParallelIterator
has methods likewith_min_len
/with_max_len
andby_uniform_blocks
/by_exponential_blocks
. However, it's true that Rayon doesn't offer any completely automatic solution here, and the fact that Spice does makes it genuinely different and interesting.Essentially, I think that there are two main factors responsible for the differences in the benchmarks between Rayon and Spice: first, that Spice is able to avoid the overhead of dynamic task dispatch by scheduling contiguous chunks of tasks for single workers; and second, that Spice is able to preempt those low-overhead contiguous chunks of tasks after they're already running in order to split them adaptively. The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me. However, the second thing is something that Rayon genuinely lacks the machinery for, which means that parallel iterators can sometimes make suboptimal splitting decisions, and so I think that aspect of the comparison is valid.
The text was updated successfully, but these errors were encountered: