-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support decompressing fixed huffman blocks #124
base: master
Are you sure you want to change the base?
Conversation
0386453
to
ddaa091
Compare
ea4ce7c
to
206623b
Compare
206623b
to
85dd9ef
Compare
85dd9ef
to
ed089c3
Compare
08cec2e
to
ae388d3
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #124 +/- ##
==========================================
- Coverage 82.40% 81.56% -0.84%
==========================================
Files 16 17 +1
Lines 625 754 +129
Branches 39 59 +20
==========================================
+ Hits 515 615 +100
- Misses 92 113 +21
- Partials 18 26 +8 ☔ View full report in Codecov by Sentry. |
huffman/src/decode.hpp
Outdated
code_table_pos = code_table.begin(); | ||
current_code = code{}; | ||
continue; | ||
return {(*found)->symbol, bits_read}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return {(*found)->symbol, bits_read}; | |
return {(*found)->symbol, (*found)->bitsize()}; |
although I think it's better to change the return type. decode_result
is imbuing "error" semantics in values with encoded_size = 0
.
I think it's easier to understand when error semantics are explicitly conveyed with a return type such as optional<encoding>
or expected<encoding, error_type?>
. Although the end-iterator pattern is also common enough that could also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah the suggested change is cleaner.
As for returning an optional or expected: I agree generally it's cleaner to signal errors explicitly, but:
- I want this code to be fast, so having an extra byte of the return type might actually matter.
- The type system allows encoded_size to be zero, but that's meaningless. So the type system is sort of forcing us to consider the possibility. May as well use that possibility to signal errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to use optional
or expected
for better semantics as a first pass.
As a second pass (or possibly in parallel with main development), we can replace std::optional
with a compressed optional type that uses the "invalid" bit patterns to track if a value exists or does not.
There's a pretty good talk on this pattern here:
https://www.youtube.com/watch?v=MWBfmmg8-Yo
This appears to be a drop-in replacement:
https://github.com/Sedeniono/tiny-optional
I haven't looked at it much and I'm sure there's others as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But feel free to close if that's something you want to change later.
std::span<std::byte> dst, | ||
std::ptrdiff_t& dst_written, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::span<std::byte> dst, | |
std::ptrdiff_t& dst_written, | |
std::span<std::byte>& dst, |
const auto lit_or_len_decoded = huffman::decode_one(len_table, src_bits); | ||
if (not lit_or_len_decoded.encoded_size) { | ||
return DecompressStatus::InvalidLitOrLen; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if src_bits
doesn't have enough bits to decode a symbol - that would be "success" and this function should exit?
If decode_one
is changed to return expected<>
, maybe something like this?
const auto lit_or_len = huffman::decode_one(len_table, src_bits);
if (not lit_or_len) {
return status_for(lit_or_len.error());
}
If src_bits
doesn't have enough bits, you do have have a partially constructed code and iterator in the code table. You could potentially keep those in order to skip some computation when more bits arrive - although it's certainly simpler to just restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of a chunked / streaming input?
If not, I don't understand why not having enough bits would be success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm thinking of the chunked/streaming case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding a TODO to handle later.
ed089c3
to
9b8d407
Compare
ae388d3
to
cb99492
Compare
9b8d407
to
75b4fc2
Compare
cb99492
to
8783255
Compare
75b4fc2
to
1b7d011
Compare
cdd7b97
to
33fc1b4
Compare
1b7d011
to
2488e0e
Compare
2488e0e
to
d3c35c5
Compare
33fc1b4
to
e18ab12
Compare
Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe
Change-Id: I9322af2674ab583f3cdb286ea06385587bacf670
Change-Id: Ib2a9cccc5df211363b2083b01d46043b329cd025
Change-Id: I5ec6e616b5ec5d33fb221e196fd9976702175bf8
d3c35c5
to
58f3d2e
Compare
Change-Id: Ib720fb2b200a72d54bae4246a42524ec9b1444b7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Let's talk on the phone or in person if there's still confusion or disagreement.
src/decompress.hpp
Outdated
/// X and Y, a string reference with <length = 5, distance = 2> | ||
/// adds X,Y,X,Y,X to the output stream." | ||
void copy_n( | ||
std::span<const std::byte>::iterator src, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the signature and I documented the precondition.
I used these types becuse:
- This function is called in only one place and these types are convenient. I would inline this function but I broke it out so I can test it separately.
- Isn't iterator arithmetic preferred over pointer arithmetic?
Do you still think there's a reason to use byte*?
return DecompressStatus::DstTooSmall; | ||
} | ||
|
||
std::copy_n(src_bits.byte_data(), len, dst.begin()); | ||
std::copy_n(src_bits.byte_data(), len, dst.begin() + dst_written); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we advance dst, my guess would be it's UB to do dst.begin() - distance
. I need to go backwards. If you have a better suggestion LMK.
huffman/src/decode.hpp
Outdated
code_table_pos = code_table.begin(); | ||
current_code = code{}; | ||
continue; | ||
return {(*found)->symbol, bits_read}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah the suggested change is cleaner.
As for returning an optional or expected: I agree generally it's cleaner to signal errors explicitly, but:
- I want this code to be fast, so having an extra byte of the return type might actually matter.
- The type system allows encoded_size to be zero, but that's meaningless. So the type system is sort of forcing us to consider the possibility. May as well use that possibility to signal errors.
const auto lit_or_len_decoded = huffman::decode_one(len_table, src_bits); | ||
if (not lit_or_len_decoded.encoded_size) { | ||
return DecompressStatus::InvalidLitOrLen; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of a chunked / streaming input?
If not, I don't understand why not having enough bits would be success?
@@ -58,22 +179,27 @@ auto decompress(std::span<const std::byte> src, std::span<std::byte> dst) | |||
return DecompressStatus::SrcTooSmall; | |||
} | |||
|
|||
if (dst.size() < len) { | |||
if (dst.size() - static_cast<std::size_t>(dst_written) < len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this comment is irrelevant unless we drpo the prefix of dst every time as suggested below.
Change-Id: I5c8a4f4c56f25ebe37a21caea97312309e19729c
std::uint16_t len{}; | ||
if (lit_or_len == detail::lit_or_len_max) { | ||
len = detail::lit_or_len_max_decoded; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you can just return instead of construction + assignment of len
?
if (lit_or_len == detail::lit_or_len_max) {
return detail::lit_or_len_max_decoded;
}
break; | ||
} | ||
if (lit_or_len > detail::lit_or_len_max) { | ||
const auto maybe_lit_or_len = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better to me.
I think this could still benefit from breaking out smaller functions - although if you think that's too much work, feel free to ignore.
const auto done =
parse_lit_or_len(lit_or_len_decoded.symbol, src_bits)
.and_then(decode_lit_or_len_into(dst))
.or_else(convert_error_status)
if (done) {
return *done;
}
// done is empty, so the the loop continues
where convert_error_status: ParseLitOrLenStatus -> optional<DecompressStatus>
return {in_place, e == EndOfBlock ? Success : InvalidLitOrLen};
and decode_lit_or_len_into
uses the poor version of pattern matching available in C++:
auto decode_lit_or_len_into(std::span<std::byte>& dst)
{
return [&dst](variant<std::byte, std::uint16_t> lit_or_len) {
return visit(
overloaded({
[&dst](byte lit) { return decode_into(lit, dst); },
[&dst](uint16_t len) { return decode_into(lit, dst); }
}),
lit_or_len
);
};
}
// https://en.cppreference.com/w/cpp/utility/variant/visit
template <class... Ts>
struct overloaded : Ts... { using Ts::operator()...; };
and
auto decode_into(byte lit, span& dst)
-> optional<DecompressStatus>
{
// lines 133-137
}
auto decode_into(uint16_t len, span& dst)
-> optional<DecompressStatus>
{
// lines 139-161
}
although that's a bit of a simplification - you'll also need to capture dst_written
.
The different cases are broken out into different functions, although there's some logic inversion with variant::visit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions on refactoring the parsing function - I think a destination_buffer
class could simplify that a bit as well.
support decompressing fixed huffman blocks
Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe