Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lz4hc compression support #21

Open
roblabla opened this issue Aug 9, 2021 · 1 comment
Open

lz4hc compression support #21

roblabla opened this issue Aug 9, 2021 · 1 comment

Comments

@roblabla
Copy link

roblabla commented Aug 9, 2021

Hello, currently lz4_flex currently only supports the "standard" LZ4 compression, which means it can only compress at level 0, 1 and 2 (in liblz4, those levels are all the same). To support higher compression levels, LZ4 also has a mode called High Compression (HC), but it is unfortunately unimplemented in lz4_flex.

I was wondering if lz4_flex has any plans to implement HC mode in its compressor?

@PSeitz
Copy link
Owner

PSeitz commented Aug 12, 2021

Hi,

generally yes, I would really like to have it in lz4_flex. I did some experiments some time ago, but they are outdated now.

I think conceptually it should be quite simple, instead of having one entry in the hashmap to search for duplicates back in the stream we would have multiple buckets for one hash and they would all get checked to find the best one.
There may be other strategies, but I think this is the main one, for better compression in lz4.

Multiple Positions per Hash in the Hashmap

e.g. imagine this byte sequence

[1, 2, 3, 4, 98, 1, 2, 3, 4, 97, 1, 2, 3, 4, 98]

While walking over the data, we put hash of every 4 bytes and its position into the hashmap.

1, 2, 3, 4 -> (hash 999, position 0)
2, 3, 4, 96 -> (hash 555, position 1)

...
when we arrive a the second entry, currently it would overwrite the old entry
1, 2, 3, 4 -> (hash 999, position 5)

If we would keep multiple entries for the hash, the last sequence 1, 2, 3, 4, 98, could compare both entries and see that the first is a longer match.

Currently the hashmap is simply a Vec, and the hash is right shifted to fit in the Vec bounds. It should be possible to double (or quadruple) the Vec and then have multiple posititons for one hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants