Performance benchmarks of tight loops etc. #243

dumblob · 2021-09-24T09:37:50Z

dumblob
Sep 24, 2021

I wonder how this Nim CPS implementation perform in tight loops and similar "critical" contexts. We can discuss theoretical performance, but I'm really interested in real measurements compared to non-CPS compiled binaries.

Anyone performed such tight loops benmarks already? What were the results?

And if anyone had benchmarks of multicore apps shuffling data back & forth between cores, that'd be even better!

Note, I'm not asking for rigorous benchmarking. I just want to get a glimpse of how well current compilers (LLVM, GCC) can optimize the code generated by this CPS macro(s) and what's the practical impact and whether it "wildly varies" or is rather stable near-constant improvement/worsening independent from scenario.

shayanhabibi · 2021-09-24T10:01:19Z

shayanhabibi
Sep 24, 2021
Maintainer

I can only speak to the shuffling of data in a multithreaded context and say that we are working on using loony with a memory safety implementation that assures temporal validity of the memory.

However that implementation will need some time to make and then benchmark against alternatives before we incorporate it

0 replies

disruptek · 2021-09-24T13:38:56Z

disruptek
Sep 24, 2021
Maintainer

https://github.com/nim-works/cps/blob/master/stash/performance.nim

Newer compilers are starting to demonstrate significant gains over closure iterators; I think you start to see notable numbers at around LLVM/GCC 9 or so and the more recent stuff is faster.

0 replies

disruptek · 2021-09-24T13:45:59Z

disruptek
Sep 24, 2021
Maintainer

cps iterator: 7.300357578 s
-3888678242538471851
closure iterator: 5.350151789 s
7679760801092768143

We lost our speed advantage, looks like. 😆

Or maybe I'm running it wrong somehow. But this is probably due to the recent workaround for the compiler fix for type conversions. I guess we really need to start monitoring performance.

0 replies

dumblob · 2021-09-24T14:12:15Z

dumblob
Sep 24, 2021
Author

Thanks for the numbers. I have right now only old CPUs (Core 2 Duo) available, so my results can't be "forward looking".

Do you use some recent CPU for measurement? New CPUs have enormous caches etc. so this might make a difference (not that I like it, I'm just pointing it out).

0 replies

disruptek · 2021-09-24T14:38:50Z

disruptek
Sep 24, 2021
Maintainer

I'm on an i7-8700K, which is pretty old now. I think I bought my first one around 5 years ago. But anyway, it wasn't that long ago that cps was 15% faster in that iterator benchmark. That said, I think there may have been a performance regression in Nim's iterators at one point, also.

Anyway... Newer compilers are getting better at optimizing our CPS output, to the tune of roughly 20% gains.

0 replies

disruptek · 2021-09-24T19:31:28Z

disruptek
Sep 24, 2021
Maintainer

With #244 cps is much slower:

cps iterator: 10 seconds, 45 milliseconds, 524 microseconds, and 44 nanoseconds
-4813472842393380880
closure iterator: 3 seconds, 149 milliseconds, 420 microseconds, and 223 nanoseconds
-4813472842393380880
Nim closure iterators are 3.189642325478901 times faster

0 replies

dumblob · 2021-09-24T20:21:39Z

dumblob
Sep 24, 2021
Author

Wow, that's interesting. Wouldn't guess it might be this heavier than "CPS immitation using closures with iterators".

5 replies

disruptek Sep 24, 2021
Maintainer

"CPS immitation using closures with iterators"

I don't know what that is, but I'm curious to learn.

From comments made in chat, it sounds like there are some Nim optimizations surrounding exception handling that our implementation doesn't benefit from -- though this is probably why we don't have more bugs in exception handling. Also, I guess there are a cumulatively large number of reference counting operations that hurt performance. That's probably not something we will try to optimize in CPS, though.

@zevv mentioned that CPS produces 6x the instructions, so if it's only 2.5x slower (after 7b5b559) maybe we should consider that a win for the C compiler.

dxxb Oct 5, 2021

"CPS immitation using closures with iterators"
I don't know what that is, but I'm curious to learn.

@disruptek, maybe @dumblob is referring to a mechanism to suspend/resume execution of functions implemented on top of iterators. I believe Python used generators as underlying mechanism for async/await, at some point in the past. Of course async/await is not the same thing as CPS.

Right before I discovered this project I was experimenting with a couple of rough macros to turn procs into iterators (async) and to consume them (await) to be used as a light async/await mechanism on a MCU.

dumblob Oct 5, 2021
Author

"CPS immitation using closures with iterators"

I don't know what that is, but I'm curious to learn.

Sorry for the delay. It's a quote from the PR #244 :

Changed closure iterator test to one where the parameters are saved. This puts it inline with what cps does.

alaviss Oct 5, 2021
Maintainer

I guess that was a little misleading :P

The test was about comparing closure iterator performance with CPS by making CPS acts like closure iterators. The closure iterator is modified to be the form that is emulated with CPS, so that its a little more fair.

dumblob Oct 5, 2021
Author

The closure iterator is modified to be the form that is emulated with CPS, so that its a little more fair.

This is exactly how I read the source code. And that's what I meant in my comment #243 (comment) above 😉.

disruptek · 2024-02-06T19:58:30Z

disruptek
Feb 6, 2024
Maintainer

Here are some current numbers.

# x86_64 AMD Ryzen 9 7950X 16-Core Processor AuthenticAMD GNU/Linux
# gcc (Gentoo Hardened 13.2.1_p20230826 p7) 13.2.1 20230826

My nim.cfg for the test:

--define:danger
--gc:arc
--panics:on
--exceptions:goto
--stacktrace:off
--define:cpsStackFrames=off
--define:cpsTraceDeque=off
--define:cpsNoTrace

old nim from devel branch, 2024-01-18 nightlies:

cps iterator: 7 seconds, 437 milliseconds, 900 microseconds, and 194 nanoseconds
-4813472842393380880
closure iterator: 2 seconds, 278 milliseconds, 324 microseconds, and 590 nanoseconds
-4813472842393380880
Nim closure iterators are 3.264635876137386 times faster

Nimskull 0.1.0-dev.21210:

cps iterator: 8 seconds, 618 milliseconds, 194 microseconds, and 725 nanoseconds
-4813472842393380880
closure iterator: 2 seconds, 138 milliseconds, 562 microseconds, and 840 nanoseconds
-4813472842393380880
Nim closure iterators are 4.029900157154138 times faster

These are both best-out-of-three.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarks of tight loops etc. #243

{{title}}

Replies: 8 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance benchmarks of tight loops etc. #243

dumblob Sep 24, 2021

Replies: 8 comments · 5 replies

shayanhabibi Sep 24, 2021 Maintainer

disruptek Sep 24, 2021 Maintainer

disruptek Sep 24, 2021 Maintainer

dumblob Sep 24, 2021 Author

disruptek Sep 24, 2021 Maintainer

disruptek Sep 24, 2021 Maintainer

dumblob Sep 24, 2021 Author

disruptek Sep 24, 2021 Maintainer

dxxb Oct 5, 2021

dumblob Oct 5, 2021 Author

alaviss Oct 5, 2021 Maintainer

dumblob Oct 5, 2021 Author

disruptek Feb 6, 2024 Maintainer

dumblob
Sep 24, 2021

Replies: 8 comments 5 replies

shayanhabibi
Sep 24, 2021
Maintainer

disruptek
Sep 24, 2021
Maintainer

disruptek
Sep 24, 2021
Maintainer

dumblob
Sep 24, 2021
Author

disruptek
Sep 24, 2021
Maintainer

disruptek
Sep 24, 2021
Maintainer

dumblob
Sep 24, 2021
Author

disruptek Sep 24, 2021
Maintainer

dumblob Oct 5, 2021
Author

alaviss Oct 5, 2021
Maintainer

dumblob Oct 5, 2021
Author

disruptek
Feb 6, 2024
Maintainer