-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ version of the benchmark #80
Comments
@rochus-keller would you want this port to be integrated? A student is actually currently also working on a C++ port of the benchmarks. One of the questions we wanted to look at is what reasonable memory management strategies are. Thanks! |
It's up to you; I use it as is; for integration with your harness additional work would be necessary.
So I guess it will be some recent C++ standard version, which is useful in any way; my port is in C++98, since everything from C++11 onwards (with some exceptions in C++17 and C++20) is essentially syntactic sugar and implementable in C++98 which can therefore be seen as the core language. I look forward to see the measured performance differences of the two C++ benchmark implementations.
I used all sorts of manual memory management strategies, plain new/delete/delete[], reference counting and helper arrays where I just put the instances; I also changed the API at some places to avoid Vector copies; and I kept dynamic array allocations (instead of stack allocation) at some places (when not too expensive) to remain similar to the Java implementation from which I derived my one. I also allowed Vector with no initial storage allocation to avoid expensive copy construction when using Vector by value. I didn't use standard library features for memory management because of your guidelines and because they give no advantage (just memory overhead).
Right; that's ideomatic in C++.
As described above. EDIT: Dictionary, JsonValue, List/Element and Richards/RBObject are using reference counting, mostly because it was straight-forward to implement; I assume one could also just call delete in the destructor if need be, but I did some measurements and found that the overhead of reference counting compared to no delete at all is only about 13%. |
I finally managed to run the numbers. The RK ones are yours. HJ is my student's code. Everything is run with -O2 flags and the same settings across the languages. I also changed your harness to match the standard harness. |
That's interesting, thanks. I'm pretty estonished how fast Temurin is. And Node.js 20.3 seems to be a bit slower than 12.16. Which GCC and CLANG versions did you use? Is the C++ code of HJ somewhere accessible? I'm actually considering a C version of the benchmark; or do you already have a student working on it? Did you have a look at my code otherwise? Do you think it meets the benchmark guidelines? Should I better replace the dynamically allocated by fix size stack arrays e.g. in Queens::queens (which increases the difference to the Java version)? |
Meanwhile I found HJ's code: https://github.com/smarr/are-we-fast-yet/blob/wip/cpp/benchmarks/C%2B%2BHJ |
Ehm, it's a bit of a mess. I should have mentioned that... it's a bit quick and dirty... Each benchmark is run on exactly one machine, but yeah, it mixes between benchmarks: g++ 9.4.0, 11.4.0 The version of the code you found is my import of his code, with small modifications. And indeed, he wanted to go with a naive version first that consistently uses
No, I haven't really yet. Generally, the things we discussed where roughly along the following lines:
|
Ok, I see. I use GCC 4.8 on my test machine for everything (and everything I reported in https://github.com/rochus-keller/Oberon/tree/master/testcases/Are-we-fast-yet also run on the same machine). I now run HJ's version on my machine as a C++11 application. To make this work I replaced all std::any by int (little influence on performance). I also had to fix Storage which was actually not called in the Harness and rendered a wrong result (I just copied my code). Here is what I get: This seems to correspond to your measurements. As it seems HJ's code spends most of its time for allocation/deallocation. The C++ standard allocator is known to be rather slow. In case of shared_ptr it's even worse since shared_ptr does an extra heap allocation for the ref counts. I think no class is used by value; everything seems to be heap allocated. I also think that there are unnecessary vector copies where the compiler doesn't automatically infer move operations, but I have to check this in the debugger. In my version e.g. Havlak and DeltaBlue spend most of their time in allocation/deallocation as well, since these algorithm depend on polymorphic objects; I could replace the allocation scheme by a special allocator or memory pooling, but this would no longer be ideomatic and contradict e.g. with the Java implementation; so I think this is the cost of C++. I also found that Richards spends nearly half of its time in dynamic_cast, which is rather surprising, but yet again the cost of C++; I could replace dynamic_cast with static_cast to gain nearly a factor two, but that would be cheating. |
One thing I missed before:
No. I think if another student would want to go with a low-level language, I would probably try to get them to use OCaml or Rust or so.
Yes, in the full report, results for the individual benchmarks look similar for me, too.
Yeah... it was a strategy to get started, and then think took longer to get correct than hoped... will have to see whether we get to the point of optimizing further. Same with the vector copies. He has a long list of things I noticed from mere code inspection. Though, I'd rather he writes up and has text to submit at the end of the project, which is very soon...
Hm. What's your reasoning here? If you know by construction it's ok, and correct? |
Ah, and yes, the Storage benchmark is known to be broken. He was still working on that one. |
First of all, dynamic_cast corresponds to the casts done in Java; if the cast doesn't work in Java, an exception is thrown; here we would have to check whether dynamic_cast returns nullptr and throw an exception ourselfes if so. Why not static cast: since the methods get a RBObject* pointer, but execute methods of a subclass, but we cannot be sure that an instance of this subclass is actually passed to the method. A better approach would be to declare the method with the subclass instead of RBObject* (if possible), or to add the required interface for each use case to RBObject, so we could call polymorphically without explicit cast.
Did vector<void> actually compile with the students compiler version?
Can I access the full report somewhere? I would like to know the exact factor between my implementation (GCC version) and Temurin. JVM is known to have a very sophisticated allocator/collector, which might be the main reason why the C++ version is only 40% (?) faster, but this is at the cost of much higher memory consumption. |
Update concerning benchmark guideline conformance. You said:
I made a branch of my C++98 version with some modifications you now explicitly allowed. Here is the commit: rochus-keller/Are-we-fast-yet@63ab1e4. The effect on overall performance is negligable, not actually worth the effort. Here are the results: Permute got ~20% faster and Queens ~50%, but Bounce and Json became slower, the former by ~20% |
Closing with #84 being merged. Thanks for the discussions. |
In case anyone is interested, here is a C++98 implementation of the benchmark: https://github.com/rochus-keller/Are-we-fast-yet/tree/main/Cpp.
I have tried to apply the guidelines mutatis mutandis, and made a compromise between how a C++98 developer would have implemented it, and being faithful to the existing Java code as far as not wasting too much performance. I also tried to avoid unnecessary rewriting. Basically we can discuss all decisions; we just have to commit to something.
There is also an existing Oberon+ version and currently I'm planning for a FreePascal version.
The text was updated successfully, but these errors were encountered: