-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map feature comparison #2
Comments
Also I should report this but the read and write guards that CHashMap uses are UB to send across threads but are marked Send + Sync. Well, file a rustsec advisory rather. CHashMap is unmaintained. |
Ah, you mean if we start listing concurrent maps in the README on this repo? Yes, I agree that that should probably be noted! |
Exactly. |
I'm going to start to put together some benchmarks for different concurrent map implementations here https://git.nebulanet.cc/Acrimon/conc-map-bench and list differences. |
Alright. I've written map adapters for a bunch of concurrent hashmaps and locked maps from std available at the git repository above. I have only ran this on my i7-7700HQ laptop and it doesn't look good for flurry. At 8 CPU's it's about equal to a RwLock'ed std HashMap. Something feels off with it. |
That's really interesting — do you have plots anywhere (throughput on y, #cpus on x)? Are you using a |
I haven't made plots yet since I haven't made any automatic system for it. Going to make some plots soon. You can trivially replicate this though by cloning the repository and running |
I didn't recently run our benchmarks since I'm currently in the middle of the tree bins, but if I recall correctly the last time I checked (after the Now, we probably should have reasonable performance when guards are not re-used, which already seemed like it was not so much the case from our own benchmarks. Running the mentioned bustle benchmarks locally, flurry performed the worst from all maps by far, up to an order of magnitude worse. I'm surprised you saw comparable performance to Out of interest, I added an adapter loosely based on our bustle branch which holds one guard for each entire run. With this, local performance becomes very much competitive with the other maps. While we can certainly debate about what the default behaviour of our users will be (or, for that matter, what API we should mainly expose), I conjecture that guards are indeed an issue. I do not know about how or why exactly this differs from Contrie. |
Hmm alright. Contrie seems to pin on every operation and is very competitive. I made these benchmarks after how the average user is likely to use the map which is what matters and what should be optimized for. So yeah, Flurry probably needs a good amount of love to improve the speed when guards aren't reused. |
My guess is that we (for some reason) generate significantly more garbage than contrie. That shouldn't be the case, and I don't think there's a fundamental reason why we would, but it would explain that difference. Guards essentially become more expensive proportionally to how frequently you generate garbage. @xacrimon I just released |
That seems like a wierd EBR implementation if that happens. Going to update the benchmark and run it on my desktop today then. I am not super familiar with crossbeam-epoch though. |
Worth noting that bustle initializes the maps with a reasonably high initial capacity by default if I understand correctly, so there are likely not that many moves occurring that would create |
My benchmark initializes with 35M keys and only about 24M entries will be
inside at the peak.
…On Mon, Mar 2, 2020, 23:08 DQ ***@***.***> wrote:
Worth noting that bustle initializes the maps with a reasonably high
initial capacity by default if I understand correctly, so there are likely
not that many moves occurring that would create Moved to be optimized.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2?email_source=notifications&email_token=AFANDB2DQRG64EKR7K57PPDRFQU7PA5CNFSM4K5OWG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRFZEQ#issuecomment-593648786>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFANDB6U3VNKBVWYHUDDCGDRFQU7PANCNFSM4K5OWG2A>
.
|
Updating flurry to 0.2 helps a bit but it is nowhere near as fast as others. |
When you say "as fast", what are you measuring? I am more interested in the scaling as the number of threads go up than the performance at lower core counts. The latter is just sort of "constant" overhead that can be fixed, whereas the former indicates a contention problem that may indicate an algorithmic problem with the datastructure. |
Contrie and the DashMap variants scale better per thread. They see close to ideal scaling while flurry seems to have a flatter scaling. Atleast on my machine. |
Interesting.. I mean, it could be that Java's |
I agree. Just to have some exemplary numbers, here's the insert-heavy workload after the version bump locally:
And flurry with only one guard:
|
Yeah, that certainly looks a lot like we're generating some unnecessary garbage that ends up hampering scalability. It also suggests that crossbeam-epoch does not handle large amounts of garbage produced across many cores very well, giving us even more of an incentive to avoid generating it in the first place. I'm curious whether the flurry model fundamentally requires more allocations (and thus garbage) than contrie, or whether it is just a matter of the code not being sufficiently carefully written. |
Contrie looks like it's doing about one allocation per insert in the average case. It uses no caching object pool either. I haven't studied the flurry code closely but it is definently allocating more. On another note, I discovered an issue in bustle. It instantly panics on prefill due to an erroneous assert. I also need to add prefill to the read and update workloads in my suite since that changes performance significantly due to probing schemes and a deeper tree in contrie. |
Interesting, what assert is erroneous? The code is pretty closely modeled after the libcuckoo benchmark, including the asserts. I'm not sure I follow why you would need to prefill differently in the read/update workloads. The prefill should always be happening regardless of the underlying workload. |
The assert at line 309 I believe. The default prefill is 0. If I raise that to 0.5 I get dramatically different performance. After some metrics gathering dashmap is doing a lot more probing than without explicit prefill. |
That assert should definitely be correct as long as the assumption that the keys are distinct as the documentation specifies. The whole benchmark relies on that assumption. The default prefill should be 75% of initial capacity, not 0..? |
pub fn new(threads: usize, mix: Mix) -> Self {
Self {
mix,
initial_cap_log2: 25,
prefill_f: 0.0,
ops_f: 0.75,
threads,
seed: None,
}
} |
The benchmark runs fine if prefill is off. |
Oh, you're right, I was thinking of |
Fix published in 0.4. |
Tests don't lie :P. Thanks for fixing it on a short notice though. Much appreciated. I will update the benchmarks. Prefill more accurately models real world scenarios. So it only makes sense to use it. |
:p It's not clear to me that "real world" scenarios will always be prefilled. I think it makes sense to test both. A common use-case for example is to have a bunch of threads collectively fill up some shared map with data as they process, in which case there really is no prefill phase. That's when things like cooperative resizing (which |
Ah, I'll do a read only benchmark too and I will also collect some profiling data and send it so we can figure out what is eating time. |
I think you were right. Here is a flamegraph of the entire benchmark. Mix was the same as before. The flurry part is on the mix -> FlurryTable slab in the graph. All dependencies are updated. |
Yeah, that certainly doesn't look great. Seems like almost all the time is spent on the garbage collection. Which is surprising to me since reads shouldn't generate garbage. This is why I think a read-only mix might be instructive, to see if |
Something is up. I put together a read only benchmark for flurry and it seems to generate gigabytes of garbage in seconds on my machine. So you are definitively generating a lot of garbage. Tracking epochs is pretty fast, contrie also uses |
That's crazy, and should definitely not be happening. In a read-only benchmark, we shouldn't be generating any garbage. I'm currently kind of swamped, but maybe @domenicquirl can take a look? |
I'm in a similar situation at the moment I fear. I've had some forced free time due to Covid19, but now everything is heading back online, except more chaotic. I've been following the discussion through mail, this does look wrong. If this is an issue with us and not |
Finally found some time to look at this. I was interested most in the "gigabytes of garbage" for a start, so I ran valgrind's DHAT on a small test with two concurrent threads reading from a pre-initialized map. To check, I first had the test use only one guard per thread, and I got only the initial creation of the map as significant allocations (plus some allocations from the test binary), split into the fast and the slow path in Then I switched to calling
Note that absolute numbers are in relation to 1000 map elements, so 2000 calls to |
I might actually know why this happens. Guards may on drop trigger a flush of the local garbage list and it looks like it doesn't check if it's empty and swaps it regardless. |
Sometimes it's also moved onto the global queue which allocates a new thread local bag of garbage. |
Wow, that would be really bad. This smells like a bug if it allocates even if there is no garbage... |
I'm also a bit confused that we spend so much time/memory allocating epoch |
Pretty sure you don't need to grab new handles. |
Hmm, that doesn't seem right. |
Why are we using a custom collector anyway? I don't see a benefit compared to the global static one. |
It is in TLS, yeah, but that's where the |
The biggest reasons is for memory reclamation. By using a collector, you can guarantee that all memory is dropped when the data structure is dropped. This also, in theory, can allow the removal of the
Are you thinking that if the guard doesn't match, then we update the thread local? That could work... Cleanup will get complicated though, because you need to also clear that thread local, or we'll potentially hold on to remaining garbage forever. It could also lead to really weird performance profiles, where if you do lookups into two maps, one after the other, then you'd get the slow path every time.
I'm not sure I fully follow what you're saying here. I agree that ideally performance should be good "by default", without the user having to do anything weird. I don't quite know how we do this without either moving back to just using the global allocator, or doing the aforementioned "optimistic caching" of the |
I was only thinking so far as that the check would cover the case where you try to access with a guard from a wrong local. I don't think I like the idea of caching only for one map, a pattern like map1.get(key1, guard1);
map2.get(key2, guard2);
map2.get(key3, guard3); is common enough that this would be a bad idea for the reasons you outlined. Is there a good way to store multiple handles in TLS dynamically, so we could store locals for each map separately for each thread? Independently, a third option would be to make Ideally, probably crossbeam would store the thread for which a local gets registered together with the local itself, and then check if it needs to create a new one or if an existing one can be reused? |
Yup, the checks would indeed catch this, but the user would probably be quite confused if their use sometimes ended up with an error due to caching.
Completely agree.
We'd have to store a
This would get rid of the check, true, but it would not solve the problem — users would still then either have to manually keep the
Hmm, this would just make I wonder what the picture looks like if a If it turns out that it really is creating a |
Why would we want to have the entire map be local to each thread? Then where would the common data be?
I'm not thinking about moving handles of safety checks, but about registering local handles and obtaining guards. If we had a list of locals with associated thread, then if a thread wants to obtain a new guard we could check that list for the given thread to see if we already registered a local for this thread. If so, we use that local to pin and get a guard, only otherwise we actually register a new local with the collector.
I agree it is likely that has more performance impact. However, given that having to do this "simple allocation" for every read accounted for almost half of the programs memory usage, it's still something I would like to avoid. Especially if having a new handle for each |
Ah, sorry, that wasn't what I meant. What I meant was that you'd store a |
Oh, ok. Then it's just the other way round from me thinking about having something like a |
Oh, no, that won't work, since |
Not directly. There is a list of locals in the |
Oh, I see, you're proposing changing the implementation of |
I've updated deps for the benchmark now and I have changed flurry to use hashmapref handles. Since I've released a v4 dashmap release candidate I believe implementations should be pretty stable. So when you can you've got the green light from me to run them. |
Did either of you actually get back to |
Also @xacrimon I tried also running the updated benchmark, but some dependency has a |
@domenicquirl Interesting, |
Worth noting is some implementations provide different features and guarantees such as being able to share reference guards across threads which is useful for async. I think that all maps included in the benchmark should be compared in features and guarantees. This can be a short table and doesn't have to be any sizable writeup. Just something to help a user choose an implementation that provides the features and guarantees they need.
The text was updated successfully, but these errors were encountered: