Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Optimizations #3

Open
arctica opened this issue Apr 17, 2020 · 3 comments
Open

Possible Optimizations #3

arctica opened this issue Apr 17, 2020 · 3 comments

Comments

@arctica
Copy link

arctica commented Apr 17, 2020

I noticed that New() has a pretty large memory overhead while doing the construction.

Creating the hash function for 100M entries allocates about 1.6GB ram while it is doing its work for a 46MB data structure. I noticed this first when I wanted to do 1B entries which should result in a data structure size of roughly 0.5GB but my laptop with 8GB ram started swapping.

Running a quick profile showed:

    1.50GB     1.50GB     68:				redo = append(redo, v)

Inspecting the code with some quick and dirty prints showed that:

  1. The redo slice grows up to more than 40M entries
  2. The underlying array gets re-allocated several times (inspected hdr.Data pointer via reflect)

I assume you implemented the BBhash γ=2 nodisk variant of the algorithm from the paper (well, gamma is dynamic but nvm). In their paper, they specify a 743MB ram allocation for 1B entries and from my testing the allocated space should be roughly linear with the number of entries so 100M should be in the range of 75MB.

Unfortunately I have no immediate ideas how to improve the memory usage of this implementation. I am not sure if the redo slice's length of 40M entries is reasonable or not.

Any thoughts?

@arctica
Copy link
Author

arctica commented Apr 17, 2020

Of course I just realize the problem after posting the issue.

I forgot the test has to allocate all keys beforehand which means 100M * 8 bytes so about 800MB allocated for that which is of course included in MemStats.Alloc. Then the realloc() allocates 40M * 8 bytes so total over 1GB makes sense. The memory profile for append(redo) is misleading as it's the sum of several allocations which are temporary.

But I still think the over 320MB ram, 40M entries for the redo slice feels a bit much. Do you think it could be possible to get closer to the 75MB that the papers results suggest?

@arctica
Copy link
Author

arctica commented Apr 20, 2020

Having dived more into the algorithm and paper, I now understand that you implemented method 3.4/1 from the paper but instead of storing keys on disk you store them in the redo slice in memory.

I tried to see how one could reduce this temporary memory usage and realized that one can just create a bitvector of size len(keys) and use that to remember which keys need to move to the next level. This reduces the memory overhead to len(keys)/8 bytes which is a lot less. The downside is that now we need to iterate over all keys (or actually bitvector bits) at each level. I've made a simple implementation and tried to make the iterations as fast as possible by e.g. skipping several iterations using bits.TrailingZeros64() but the construction time compared to the redo slice increased roughly 25%. Just after now re-reading the paper I think this is actually what the paper ment by the /2 variant. So in some cases one might want to sacrifice a little bit more CPU to avoid temporary memory overhead. I realize that in this specific implementation all keys have to reside in memory anyways so something like 50% more might not be a showstopper. But one can envision a similar implementation which does not require all keys to be available at a time, instead taking an iterator and number of elements as parameters instead.

I came up with two optimizations for the redo overhead that don't require any API changes:

  1. Instead of storing the 64 bit keys themselves in the redo slice, just store indexes to the original keys array. This way one could use e.g. []uint32 if we know that there are less than 2^32 keys. I suspect most times people will store less than 4 billion keys. The type can be decided at runtime depending on len(keys). Especially considering that the currently used bitvectors only support uint32. Maybe even just an array of bits.Len(len(keys)) bit-integers. That would be optimal in terms of space, be completely dynamic and only have minor CPU overhead.

  2. To avoid many re-allocs of the underlying array of redo, one can pre-allocate the slice with a cap that will most likely be sufficient. Taking the math from the paper, we expect N*e^(-1/gamma) keys without collision at level 0. So setting the initial cap to N - N*e^(-1/gamma) or maybe 1% more than that could avoid nearly all re-allocations. In my tests this completely eliminated the re-allocs.

Summary: at the cost of about 25% more CPU during construction one can cut the temporary memory overhead from roughly 50%, ~43 bits (depending on gamma) to one bit per key. Even without further CPU overhead it should be possible to cut memory overhead by about half.

The double loop over all keys made me think if it is possible to instead do everything in one iteration. The second iteration is necessary because when one detects a collision, one only knows the current key but not the key that it collided with. Therefor, by introducing more memory overhead and remembering the key for a given bit in the bitvector, one can immediately bump both colliding keys to the redo list. This avoids the second iteration over all keys in a given level. Unfortunately the memory overhead is large: it is as big as the keys array itself. Taking the optimization from before and storing only 32bit offsets into the keys array brings overhead to 50% of input data. I created a test implementation of this technique but unfortunately I saw no speed gains and in fact saw a 15% slowdown. This was unexpected and I am not sure exactly why - even after profiling - but I suspect the much increased cache misses due to accessing that large temporary array in random fashion is the culprit. 15.6s out of 20.9s or nearly 75% of time were spent either accessing the collision keys array or the original data keys array. I guess that makes sense as profiling your original implementation shows that slighly more than half of CPU time is also spent in get() or set() for the bitvectors.
I don't think it is possible to reduce this overhead as due to the nature of hashes, memory access will have to be random. Reducing the input size so that the bitvectors fit into L3 cache shows significant speedups (nearly 2x) on my laptop.

@arctica arctica changed the title Large Memory Overhead Possible Optimizations Apr 20, 2020
@dgryski
Copy link
Owner

dgryski commented Apr 23, 2020

Thanks for your detailed write-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants