-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison to Weave #7
Comments
I'm not aware of Weave. As far as I can see one difference is that Weave is not for C? While Lace is for C? You can easily compile Lace and run the fibonnaci benchmark and compare it with Weave if you want to compare the performance. I don't have time for this right now. |
Weave shall compile to C. Should you find some time to write down how the internals of Lace work and what are the trade-offs, please let me know. Thanks! |
Any insights how Lace works? |
I have written two papers about it back in 2014 when it was in active development and there are examples in the benchmarks folder. For example You initialize lace with There are a bunch more advanced features like barriers and interrupts, they are used in my Sylvan project. You can also produce statistics on the performance of the load balancer. And you can temporarily stop the Lace workers and restart them using |
Internally as described in my early PhD work, Lace uses a special queue that is optimized for this particular task. It's the topic of one of the papers. The infrastructure is very similar to Wool, a project by Karl Filip Faxen in 2010. This in turn was inspired on Cilk, which used to be part of the GCC compiler for a while. In my opinion, Lace and Wool offer the best performance for fine-grained task-based parallelism. (fork-join semantics) but of course the downside is the abuse of C macros... |
By the way you could probably easily check benchmarks if you have experience with Weave. I see here https://github.com/mratsim/weave/tree/master/benchmarks that several of these are from Cilk so probably quite similar to benchmarks implemented in Lace. Simply use |
Takes some effort to install this with outdated
With Lace, I get:
So Lace in 4 seconds, Weave in 41 seconds. Other benchmark, fib... disappointing. Weave takes 10.852 seconds to compute fib 38. Lace does it in 0.24 seconds on my machine. Maybe this is not how I'm supposed to run Weave, but I have no clue, never used nim before. |
Wow you are quick! Thanks a lot for the description of Lace - admittedly I didn't read the paper, so this your overview made me willing to read it! Btw. yes, If I have any more questions, I will certainly come back 😉. Btw. thanks for mentioning Sylvan - I might have a use case for it in a year or two. |
Using
So very similar performance for NQueens, Lace takes 23.01 seconds and Weave takes 24.37 seconds.
For Fibonacci, which is very sensitive to overhead, Lace takes 188 ms and Weave takes 2663 ms. |
In case you wonder if I cheat with
|
Very interesting! That sparked my interest in seeing even the assembly of both "contestants". The "variation" in performance seen for Weave is something I would like to investigate a bit. @mratsim would this be something to investigate (maybe it has to do with the changes in Nim lately)? Btw. @trolando do you use Clang or GCC compiler for both "contestants" on your machine? |
Unfortunately improving multithreading is not my priority at the moment, but I did review the code recently. Here are Lace and Weave on my hardware (i9-9980XE 18 cores): So 2x to 2.5x faster on nqueens but 2.5x to 10x (!) slower on fibonacci. Weave has 2 configuration, "eager allocated futures" and "lazy allocated futures" with very different overheads (see the # of page faults). The thing I'm not too sure about is that fib 40 used to take 180 to 220ms with lazy futures (see mratsim/weave#13) and 350ms-400ms with eager futures (see mratsim/weave#112) on the same machine . I retested fibril https://github.com/chaoran/fibril and perf didn't change there so unsure what's up. Fibril is the lowest overhead I benchmarked on fibonacci when developping Weave, but it took an approach that required platform-specific assembly as it is continuation-stealing/parent-stealing based:
Weave is based on the PhD of @aprell (see C impl: https://github.com/aprell/tasking-2.0) https://epub.uni-bayreuth.de/2990/ and also a collection of research here https://github.com/numforge/laser/blob/e23b5d6/research/runtime_threads_tasks_allocation_NUMA.md The main internal characteristic is that it's based on fully private deques, and steal requests are actually work-sharing requests. Also, the theft policy, steal-one vs steal-half is adaptative, and for for-loop, loops are partitioned lazily instead of eagerly and so auto-adapt to the function being called and the resources available. In terms of usage, it seems to me like Lace require structured parallelism, i.e. a spawn has to be synced before function exit (async/finish from Habanero, see also structured concurrency in IO runtimes https://en.wikipedia.org/wiki/Structured_concurrency). In Weave, this is the case for the lazy futures (because of alloca) but in the default case, futures can escape their allocating scope. This allows quite flexible usage in complex codebase, at the price of memory allocation overhead. Also Lace seems to keep idle threads spinning. Some of the main changes made in Weave compared to the C code were:
|
Potentially related: #9 (comment) |
This looks like a simple yet quite efficient implementation of work stealing. I wonder how this compares to a bit more modern work stealing framework Weave.
Before I spend hours of reading & understanding the inner details of Lace, could anyone make a rough comparison (of approaches taken as well as real measured performance). Weave has quite readable primer & documentation, so this comparison is more about the Lace side 😉.
The text was updated successfully, but these errors were encountered: