ZstdFrameCompressor with input buffer #53

nhz2 · 2024-05-25T20:21:29Z

An alternative to #52 and #46

The reason an input buffer is needed is because if input.size is zero, process function has no way of knowing if all the input is available, or if process will be called again with new input that must be appended to the current frame.

codecov · 2024-05-25T20:22:30Z

Codecov Report

Attention: Patch coverage is 76.00000% with 18 lines in your changes are missing coverage. Please review.

Project coverage is 38.73%. Comparing base (ad7288a) to head (e617c5d).

Files	Patch %	Lines
src/framecompressor.jl	76.00%	18 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #53      +/-   ##
==========================================
+ Coverage   33.01%   38.73%   +5.71%     
==========================================
  Files           5        6       +1     
  Lines         524      599      +75     
==========================================
+ Hits          173      232      +59     
- Misses        351      367      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nhz2 · 2024-05-25T20:46:36Z

I don't see a major slowdown from the added copy.

julia> d = rand(0x01:0x09, 100000);

julia> @btime transcode(ZstdCompressor, $(d));
  499.738 μs (8 allocations: 97.94 KiB)

julia> @btime transcode(ZstdFrameCompressor, $(d));
  534.875 μs (12 allocations: 196.16 KiB)

julia> d = rand(0x01:0x09, 1000);

julia> @btime transcode(ZstdCompressor, $(d));
  11.001 μs (7 allocations: 1.27 KiB)

julia> @btime transcode(ZstdFrameCompressor, $(d));
  4.494 μs (11 allocations: 2.50 KiB)

mkitti · 2024-05-26T03:49:16Z

#52, unlike #46, actually does deal with additional data in the stream beyond the initial data. What #52 does is create and save multiple frames. Each batch of new data just gets saved as an individual frame, each knowing exactly the size of the corresponding decompressed data. Each frame is then appended to the output stream.

ZSTD_e_end=2       /* flush any remaining data _and_ close current frame.
                   * note that frame is only closed after compressed data is fully flushed (return value == 0).
                  * After that point, any additional data starts a new frame.
                   * note : each frame is independent (does not reference any content from previous frame).
                   : note : multithreaded compression will block to flush as much output as possible. */

Can you establish that the way JLD2.jl uses transcode would result in multiple frames under #52 ? Furthermore, I see applications where it may make sense to stream multiple frames with known decompresssed sizes.

https://github.com/JuliaIO/JLD2.jl/blob/98949ead79ff2875cbfb5fa92517b8ead986dd43/src/compression.jl#L197-L209

By buffering here, a single frame is created, which may have some advantages, but may also not be necessary for the JLD2.jl case. I'm also unsure if this buffering needs to be necessarily specific to Zstandard. It could be useful for other codecs as well.

My suggestion for this code is that we consider abstracting it out into its own codec, and then figure out how to compose it with other codecs in a generic way. I highly suspect that we will need this for LZ4 compression when we get around to fixing LZ4.jl interoperability with HDF5.

The problem is that the H5lz4 codec invents its own frame format saving the decompressed size and the block size before compressing. It also saves the size of each compressed block.
https://github.com/HDFGroup/hdf5_plugins/blob/eb77e73ffee9788dde316bc9adba8247126c67d5/LZ4/src/H5Zlz4.c#L207

nhz2 · 2024-05-26T17:50:19Z

Yes, both this PR and #52 should work fine for JLD2. Adding the buffer just makes this codec more consistent with ZstdCompressor when used for streaming. With this PR to create multiple frames, TOKEN_END can be written to explicitly end a frame.

mkitti · 2024-05-28T15:46:34Z

In comparison to #52, I would consider renaming this "ZstdSingleFrameCompressor" which may still have utility for compatibility with errorneous decompression routines that incorrectly determine the decompressed size.

nhz2 · 2024-05-29T00:00:03Z

I'm going to close this for now, since #52 seems like a better solution for JLD2. I agree that most of the state management code here could be abstracted out, and each non streaming compressor like H5LZ4 or Blosc would just need to implement something like compress!(codec, out_memory, in_memory) -> actual out size and compress_bound(codec, in_size) -> max out size methods.

* Implement ZstdFrameCompressor via endOp * Repeat calling compress! with same input until code == 0 with ZSTD_e_end * Adopt additional tests from #53 * Allocate an input buffer when using ZstdFrameCompressor * Simplify, remove buffer, just keep ibuffer pos and size same to complete frame * Reset input and output buffers of Cstream on initialize and finalize * Reset buffers on decompression

frame compressor with buffer

03a5219

nhz2 requested a review from mkitti May 25, 2024 20:21

add missing GC.@preserve

dcf3615

add test for streaming edge case

e617c5d

mkitti added a commit to mkitti/CodecZstd.jl that referenced this pull request May 28, 2024

Adopt additional tests from JuliaIO#53

6f594b7

nhz2 closed this May 29, 2024

nhz2 deleted the nz/buffered-frame-compressor branch September 8, 2024 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZstdFrameCompressor with input buffer #53

ZstdFrameCompressor with input buffer #53

nhz2 commented May 25, 2024

codecov bot commented May 25, 2024 •

edited

Loading

nhz2 commented May 25, 2024

mkitti commented May 26, 2024

nhz2 commented May 26, 2024

mkitti commented May 28, 2024 •

edited

Loading

nhz2 commented May 29, 2024 •

edited

Loading

ZstdFrameCompressor with input buffer #53

ZstdFrameCompressor with input buffer #53

Conversation

nhz2 commented May 25, 2024

codecov bot commented May 25, 2024 • edited Loading

Codecov Report

nhz2 commented May 25, 2024

mkitti commented May 26, 2024

nhz2 commented May 26, 2024

mkitti commented May 28, 2024 • edited Loading

nhz2 commented May 29, 2024 • edited Loading

codecov bot commented May 25, 2024 •

edited

Loading

mkitti commented May 28, 2024 •

edited

Loading

nhz2 commented May 29, 2024 •

edited

Loading