Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache line pad mutex #4

Open
jnordwick opened this issue Aug 13, 2024 · 5 comments
Open

cache line pad mutex #4

jnordwick opened this issue Aug 13, 2024 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jnordwick
Copy link

jnordwick commented Aug 13, 2024

I'm not sure this will particularly matter, but the ThreadPool mutex is going to be sharing a cache line with other other data. Since zig reorders fields in regular structs, its unclear what it is sharing a cache line with: could be something that matters, could be irrelevant.

@notcancername
Copy link

notcancername commented Aug 14, 2024

Consider std.atomic.cache_line for this purpose.

@jnordwick
Copy link
Author

jnordwick commented Aug 15, 2024

Consider std.atomic.cache_line for this purpose.

Those values seem really dated. They are taking from this list that rust and go also use. They show x86_64 destructive interference as 128B, two cache lines, but that is because of the old L2 spatial prefetcher, and there might be a misunderstanding on how it works.

I never saw good data. C++ gives 64 for destructive interference (false sharing) and 128 for constructive interference (true sharing), and I think those are more correct.

I know they go back a while, and I've never seen compelling evidence. Just some random assertions on comments. The L2 spatial prefetcher will try to complete pairs of lines, but it won't evict L1 data that is modified.

Anyways. One a 3 year old (2021) AMD Ryzen 7 PRO 5850U, these are the timings for various paddings:

pad to   8: 3961705236 ns
pad to  16: 2054116545 ns
pad to  32: 1948227890 ns
pad to  64: 178296429 ns
pad to 128: 177247588 ns
pad to 256: 180107536 ns

64 bytes and above are all the same (on multiple runs, any of those can out faster than the others). Here's the really ugly code I used to test it. if you want to try on your machine, just save it and there is a comment at the top that gives a line to copy and past that will run it for various padding sizes:

https://gist.github.com/jnordwick/b30b1584fd7c49d68a6bb842abf7d98b

EDIT: Here are the results on a 2018 Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz:. The sane pattern. So whatever led to those old results doesn't seem to apply anymore. Or my test is fucked:

pad to   8: 5155030822 ns
pad to  16: 5961970989 ns
pad to  32: 3738350274 ns
pad to  64: 466117370 ns
pad to 128: 466903391 ns
pad to 256: 470904977 ns

@judofyr judofyr added help wanted Extra attention is needed enhancement New feature or request labels Aug 15, 2024
@judofyr
Copy link
Owner

judofyr commented Aug 16, 2024

Oh, very interesting. I don't know too much about mutexes + cache lines so at the moment I don't think I'll look into this. Maybe later when I want to learn more I'll use this as an example to run some benchmark.

Keeping this issue open in case others want to improve.

@QuarticCat
Copy link

I know they go back a while, and I've never seen compelling evidence. Just some random assertions on comments. The L2 spatial prefetcher will try to complete pairs of lines, but it won't evict L1 data that is modified.

Try this:

#include <atomic>
#include <thread>

alignas(128) std::atomic<int> counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; ++j) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 0);
    std::thread t3(update, 16);
    std::thread t4(update, 16);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

which comes from a question I asked on StackOverflow.

@jnordwick
Copy link
Author

jnordwick commented Sep 25, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants