Replies: 8 comments 5 replies
-
I can only speak to the shuffling of data in a multithreaded context and say that we are working on using loony with a memory safety implementation that assures temporal validity of the memory. However that implementation will need some time to make and then benchmark against alternatives before we incorporate it |
Beta Was this translation helpful? Give feedback.
-
https://github.com/nim-works/cps/blob/master/stash/performance.nim Newer compilers are starting to demonstrate significant gains over closure iterators; I think you start to see notable numbers at around LLVM/GCC 9 or so and the more recent stuff is faster. |
Beta Was this translation helpful? Give feedback.
-
We lost our speed advantage, looks like. 😆 Or maybe I'm running it wrong somehow. But this is probably due to the recent workaround for the compiler fix for type conversions. I guess we really need to start monitoring performance. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the numbers. I have right now only old CPUs (Core 2 Duo) available, so my results can't be "forward looking". Do you use some recent CPU for measurement? New CPUs have enormous caches etc. so this might make a difference (not that I like it, I'm just pointing it out). |
Beta Was this translation helpful? Give feedback.
-
I'm on an i7-8700K, which is pretty old now. I think I bought my first one around 5 years ago. But anyway, it wasn't that long ago that cps was 15% faster in that iterator benchmark. That said, I think there may have been a performance regression in Nim's iterators at one point, also. Anyway... Newer compilers are getting better at optimizing our CPS output, to the tune of roughly 20% gains. |
Beta Was this translation helpful? Give feedback.
-
With #244 cps is much slower:
|
Beta Was this translation helpful? Give feedback.
-
Wow, that's interesting. Wouldn't guess it might be this heavier than "CPS immitation using closures with iterators". |
Beta Was this translation helpful? Give feedback.
-
Here are some current numbers.
My
old nim from devel branch, 2024-01-18 nightlies:
Nimskull 0.1.0-dev.21210:
These are both best-out-of-three. |
Beta Was this translation helpful? Give feedback.
-
I wonder how this Nim CPS implementation perform in tight loops and similar "critical" contexts. We can discuss theoretical performance, but I'm really interested in real measurements compared to non-CPS compiled binaries.
Anyone performed such tight loops benmarks already? What were the results?
And if anyone had benchmarks of multicore apps shuffling data back & forth between cores, that'd be even better!
Note, I'm not asking for rigorous benchmarking. I just want to get a glimpse of how well current compilers (LLVM, GCC) can optimize the code generated by this CPS macro(s) and what's the practical impact and whether it "wildly varies" or is rather stable near-constant improvement/worsening independent from scenario.
Beta Was this translation helpful? Give feedback.
All reactions