High-level description of S2's compression process? #733

marklit · 2023-01-07T04:43:07Z

marklit
Jan 7, 2023

Do you have any high-level documentation on how S2 compresses data? I can't find any papers published by Google on the topic and your implementation appears to be the only one available to the public.

I built a flame graph of s2c compressing a file and it reported little more than 90% of the time being spent in encodeBetterBlockAsm4MB.

encodeblock_amd64.s has around 18K lines of Assembler. I believe this file is hand-written rather than generated based on the commits being made to it but there aren't any annotations. There are 55 jump points with _encodeBetterBlockAsm4MB in their name but these won't be visible to perf when running.

S2 is said to detect and skip over data that isn't compressible and the way this is described makes me believe Snappy never had this sort of capability but when I examined Google's Snappy C code I can see there is heuristic match skipping support in their encoder. I suspect this is further enhanced in S2 to the point that it's worth mentioning as its own feature. It would be great to learn what different techniques S2 improves upon compared to Snappy.

Answered by klauspost

Jan 7, 2023

Do you have any high-level documentation on how S2 compresses data?

Compression formats are defined by their decompression algorithm. S2 is defined by the Snappy format (blocks/frames) with 3 minor changes.

You can an implementation in the Go decoder.

I believe this file is hand-written rather than generated based on the commits being made to it but there aren't any annotations.

The assembler is generated from pseudo-assembler using avo.

S2 is said to detect and skip over data that isn't compressible and the way this is described makes me believe Snappy never had this sort of capability but when I examined Google's Snappy C code I can see there is heuristic match skipping support in …

View full answer

klauspost · 2023-01-07T11:37:21Z

klauspost
Jan 7, 2023
Maintainer

Do you have any high-level documentation on how S2 compresses data?

Compression formats are defined by their decompression algorithm. S2 is defined by the Snappy format (blocks/frames) with 3 minor changes.

You can an implementation in the Go decoder.

I believe this file is hand-written rather than generated based on the commits being made to it but there aren't any annotations.

The assembler is generated from pseudo-assembler using avo.

S2 is said to detect and skip over data that isn't compressible and the way this is described makes me believe Snappy never had this sort of capability but when I examined Google's Snappy C code I can see there is heuristic match skipping support in their encoder.

I am not entirely sure what you reference. When I reference snappy, it is the Go package. I believe it is in there after convincing the devs a long time ago. I am pretty sure I also was the one who convinced them to add it to the C version, but I can't find that information anywhere. Please point me to where I state that the Go package doesn't have that, since that is outdated.

EDIT: Oh, they mentioned it in the commit.

I suspect this is further enhanced in S2 to the point that it's worth mentioning as its own feature.

It is important because many compression users are hesitant to use compression on already compressed material, since many older compressors become very slow on incompressible input. Here it is no big deal, since the compressor will quickly skip it, so it is important developer information.

To see the current encoders, you can reference the Go versions: default, better, best. The snappy compatible versions are in the same files.

2 replies

marklit Jan 7, 2023
Author

Please point me to where I state that the Go package doesn't have that, since that is outdated.

This is most likely my misunderstanding of a description I read somewhere back in 2021.

Thank you for the above. I learned a lot.

klauspost Jan 7, 2023
Maintainer

Great!

You are also welcome to look at my sheet which I try to keep up-to-date comparing various Go compression methods on different data type.

snappy is the Golang snappy package. s2 is ... well... s2 and s2s is s2 with snappy compatible output. lz4 is https://github.com/pierrec/lz4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High-level description of S2's compression process? #733

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

High-level description of S2's compression process? #733

marklit Jan 7, 2023

Replies: 1 comment · 2 replies

klauspost Jan 7, 2023 Maintainer

marklit Jan 7, 2023 Author

klauspost Jan 7, 2023 Maintainer

marklit
Jan 7, 2023

Replies: 1 comment 2 replies

klauspost
Jan 7, 2023
Maintainer

marklit Jan 7, 2023
Author

klauspost Jan 7, 2023
Maintainer