-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Optimizations #3
Comments
Of course I just realize the problem after posting the issue. I forgot the test has to allocate all keys beforehand which means 100M * 8 bytes so about 800MB allocated for that which is of course included in MemStats.Alloc. Then the realloc() allocates 40M * 8 bytes so total over 1GB makes sense. The memory profile for append(redo) is misleading as it's the sum of several allocations which are temporary. But I still think the over 320MB ram, 40M entries for the redo slice feels a bit much. Do you think it could be possible to get closer to the 75MB that the papers results suggest? |
Having dived more into the algorithm and paper, I now understand that you implemented method 3.4/1 from the paper but instead of storing keys on disk you store them in the redo slice in memory. I tried to see how one could reduce this temporary memory usage and realized that one can just create a bitvector of size len(keys) and use that to remember which keys need to move to the next level. This reduces the memory overhead to len(keys)/8 bytes which is a lot less. The downside is that now we need to iterate over all keys (or actually bitvector bits) at each level. I've made a simple implementation and tried to make the iterations as fast as possible by e.g. skipping several iterations using bits.TrailingZeros64() but the construction time compared to the redo slice increased roughly 25%. Just after now re-reading the paper I think this is actually what the paper ment by the /2 variant. So in some cases one might want to sacrifice a little bit more CPU to avoid temporary memory overhead. I realize that in this specific implementation all keys have to reside in memory anyways so something like 50% more might not be a showstopper. But one can envision a similar implementation which does not require all keys to be available at a time, instead taking an iterator and number of elements as parameters instead. I came up with two optimizations for the redo overhead that don't require any API changes:
Summary: at the cost of about 25% more CPU during construction one can cut the temporary memory overhead from roughly 50%, ~43 bits (depending on gamma) to one bit per key. Even without further CPU overhead it should be possible to cut memory overhead by about half. The double loop over all keys made me think if it is possible to instead do everything in one iteration. The second iteration is necessary because when one detects a collision, one only knows the current key but not the key that it collided with. Therefor, by introducing more memory overhead and remembering the key for a given bit in the bitvector, one can immediately bump both colliding keys to the redo list. This avoids the second iteration over all keys in a given level. Unfortunately the memory overhead is large: it is as big as the keys array itself. Taking the optimization from before and storing only 32bit offsets into the keys array brings overhead to 50% of input data. I created a test implementation of this technique but unfortunately I saw no speed gains and in fact saw a 15% slowdown. This was unexpected and I am not sure exactly why - even after profiling - but I suspect the much increased cache misses due to accessing that large temporary array in random fashion is the culprit. 15.6s out of 20.9s or nearly 75% of time were spent either accessing the collision keys array or the original data keys array. I guess that makes sense as profiling your original implementation shows that slighly more than half of CPU time is also spent in get() or set() for the bitvectors. |
Thanks for your detailed write-up. |
I noticed that New() has a pretty large memory overhead while doing the construction.
Creating the hash function for 100M entries allocates about 1.6GB ram while it is doing its work for a 46MB data structure. I noticed this first when I wanted to do 1B entries which should result in a data structure size of roughly 0.5GB but my laptop with 8GB ram started swapping.
Running a quick profile showed:
Inspecting the code with some quick and dirty prints showed that:
I assume you implemented the
BBhash γ=2 nodisk
variant of the algorithm from the paper (well, gamma is dynamic but nvm). In their paper, they specify a 743MB ram allocation for 1B entries and from my testing the allocated space should be roughly linear with the number of entries so 100M should be in the range of 75MB.Unfortunately I have no immediate ideas how to improve the memory usage of this implementation. I am not sure if the redo slice's length of 40M entries is reasonable or not.
Any thoughts?
The text was updated successfully, but these errors were encountered: