-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZstdFrameCompressor with input buffer #53
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #53 +/- ##
==========================================
+ Coverage 33.01% 38.73% +5.71%
==========================================
Files 5 6 +1
Lines 524 599 +75
==========================================
+ Hits 173 232 +59
- Misses 351 367 +16 ☔ View full report in Codecov by Sentry. |
I don't see a major slowdown from the added copy. julia> d = rand(0x01:0x09, 100000);
julia> @btime transcode(ZstdCompressor, $(d));
499.738 μs (8 allocations: 97.94 KiB)
julia> @btime transcode(ZstdFrameCompressor, $(d));
534.875 μs (12 allocations: 196.16 KiB)
julia> d = rand(0x01:0x09, 1000);
julia> @btime transcode(ZstdCompressor, $(d));
11.001 μs (7 allocations: 1.27 KiB)
julia> @btime transcode(ZstdFrameCompressor, $(d));
4.494 μs (11 allocations: 2.50 KiB) |
#52, unlike #46, actually does deal with additional data in the stream beyond the initial data. What #52 does is create and save multiple frames. Each batch of new data just gets saved as an individual frame, each knowing exactly the size of the corresponding decompressed data. Each frame is then appended to the output stream.
Can you establish that the way JLD2.jl uses By buffering here, a single frame is created, which may have some advantages, but may also not be necessary for the JLD2.jl case. I'm also unsure if this buffering needs to be necessarily specific to Zstandard. It could be useful for other codecs as well. My suggestion for this code is that we consider abstracting it out into its own codec, and then figure out how to compose it with other codecs in a generic way. I highly suspect that we will need this for LZ4 compression when we get around to fixing LZ4.jl interoperability with HDF5. The problem is that the H5lz4 codec invents its own frame format saving the decompressed size and the block size before compressing. It also saves the size of each compressed block. |
Yes, both this PR and #52 should work fine for JLD2. Adding the buffer just makes this codec more consistent with |
In comparison to #52, I would consider renaming this "ZstdSingleFrameCompressor" which may still have utility for compatibility with errorneous decompression routines that incorrectly determine the decompressed size. |
I'm going to close this for now, since #52 seems like a better solution for JLD2. I agree that most of the state management code here could be abstracted out, and each non streaming compressor like H5LZ4 or Blosc would just need to implement something like |
* Implement ZstdFrameCompressor via endOp * Repeat calling compress! with same input until code == 0 with ZSTD_e_end * Adopt additional tests from #53 * Allocate an input buffer when using ZstdFrameCompressor * Simplify, remove buffer, just keep ibuffer pos and size same to complete frame * Reset input and output buffers of Cstream on initialize and finalize * Reset buffers on decompression
An alternative to #52 and #46
The reason an input buffer is needed is because if
input.size
is zero,process
function has no way of knowing if all the input is available, or ifprocess
will be called again with new input that must be appended to the current frame.