Possible Optimizations #3

arctica · 2020-04-17T22:10:52Z

I noticed that New() has a pretty large memory overhead while doing the construction.

Creating the hash function for 100M entries allocates about 1.6GB ram while it is doing its work for a 46MB data structure. I noticed this first when I wanted to do 1B entries which should result in a data structure size of roughly 0.5GB but my laptop with 8GB ram started swapping.

Running a quick profile showed:

    1.50GB     1.50GB     68:				redo = append(redo, v)

Inspecting the code with some quick and dirty prints showed that:

The redo slice grows up to more than 40M entries
The underlying array gets re-allocated several times (inspected hdr.Data pointer via reflect)

I assume you implemented the BBhash γ=2 nodisk variant of the algorithm from the paper (well, gamma is dynamic but nvm). In their paper, they specify a 743MB ram allocation for 1B entries and from my testing the allocated space should be roughly linear with the number of entries so 100M should be in the range of 75MB.

Unfortunately I have no immediate ideas how to improve the memory usage of this implementation. I am not sure if the redo slice's length of 40M entries is reasonable or not.

Any thoughts?

The text was updated successfully, but these errors were encountered:

arctica · 2020-04-17T22:26:50Z

Of course I just realize the problem after posting the issue.

I forgot the test has to allocate all keys beforehand which means 100M * 8 bytes so about 800MB allocated for that which is of course included in MemStats.Alloc. Then the realloc() allocates 40M * 8 bytes so total over 1GB makes sense. The memory profile for append(redo) is misleading as it's the sum of several allocations which are temporary.

But I still think the over 320MB ram, 40M entries for the redo slice feels a bit much. Do you think it could be possible to get closer to the 75MB that the papers results suggest?

arctica · 2020-04-20T22:30:56Z

Having dived more into the algorithm and paper, I now understand that you implemented method 3.4/1 from the paper but instead of storing keys on disk you store them in the redo slice in memory.

I tried to see how one could reduce this temporary memory usage and realized that one can just create a bitvector of size len(keys) and use that to remember which keys need to move to the next level. This reduces the memory overhead to len(keys)/8 bytes which is a lot less. The downside is that now we need to iterate over all keys (or actually bitvector bits) at each level. I've made a simple implementation and tried to make the iterations as fast as possible by e.g. skipping several iterations using bits.TrailingZeros64() but the construction time compared to the redo slice increased roughly 25%. Just after now re-reading the paper I think this is actually what the paper ment by the /2 variant. So in some cases one might want to sacrifice a little bit more CPU to avoid temporary memory overhead. I realize that in this specific implementation all keys have to reside in memory anyways so something like 50% more might not be a showstopper. But one can envision a similar implementation which does not require all keys to be available at a time, instead taking an iterator and number of elements as parameters instead.

I came up with two optimizations for the redo overhead that don't require any API changes:

Instead of storing the 64 bit keys themselves in the redo slice, just store indexes to the original keys array. This way one could use e.g. []uint32 if we know that there are less than 2^32 keys. I suspect most times people will store less than 4 billion keys. The type can be decided at runtime depending on len(keys). Especially considering that the currently used bitvectors only support uint32. Maybe even just an array of bits.Len(len(keys)) bit-integers. That would be optimal in terms of space, be completely dynamic and only have minor CPU overhead.
To avoid many re-allocs of the underlying array of redo, one can pre-allocate the slice with a cap that will most likely be sufficient. Taking the math from the paper, we expect N*e^(-1/gamma) keys without collision at level 0. So setting the initial cap to N - N*e^(-1/gamma) or maybe 1% more than that could avoid nearly all re-allocations. In my tests this completely eliminated the re-allocs.

Summary: at the cost of about 25% more CPU during construction one can cut the temporary memory overhead from roughly 50%, ~43 bits (depending on gamma) to one bit per key. Even without further CPU overhead it should be possible to cut memory overhead by about half.

The double loop over all keys made me think if it is possible to instead do everything in one iteration. The second iteration is necessary because when one detects a collision, one only knows the current key but not the key that it collided with. Therefor, by introducing more memory overhead and remembering the key for a given bit in the bitvector, one can immediately bump both colliding keys to the redo list. This avoids the second iteration over all keys in a given level. Unfortunately the memory overhead is large: it is as big as the keys array itself. Taking the optimization from before and storing only 32bit offsets into the keys array brings overhead to 50% of input data. I created a test implementation of this technique but unfortunately I saw no speed gains and in fact saw a 15% slowdown. This was unexpected and I am not sure exactly why - even after profiling - but I suspect the much increased cache misses due to accessing that large temporary array in random fashion is the culprit. 15.6s out of 20.9s or nearly 75% of time were spent either accessing the collision keys array or the original data keys array. I guess that makes sense as profiling your original implementation shows that slighly more than half of CPU time is also spent in get() or set() for the bitvectors.
I don't think it is possible to reduce this overhead as due to the nature of hashes, memory access will have to be random. Reducing the input size so that the bitvectors fit into L3 cache shows significant speedups (nearly 2x) on my laptop.

dgryski · 2020-04-23T18:46:39Z

Thanks for your detailed write-up.

arctica changed the title ~~Large Memory Overhead~~ Possible Optimizations Apr 20, 2020

greatroar mentioned this issue Mar 13, 2021

Optimize garbage collector behavior restic/restic#3328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Optimizations #3

Possible Optimizations #3

arctica commented Apr 17, 2020 •

edited

Loading

arctica commented Apr 17, 2020

arctica commented Apr 20, 2020 •

edited

Loading

dgryski commented Apr 23, 2020

Possible Optimizations #3

Possible Optimizations #3

Comments

arctica commented Apr 17, 2020 • edited Loading

arctica commented Apr 17, 2020

arctica commented Apr 20, 2020 • edited Loading

dgryski commented Apr 23, 2020

arctica commented Apr 17, 2020 •

edited

Loading

arctica commented Apr 20, 2020 •

edited

Loading