support decompressing fixed huffman blocks #124

garymm · 2024-01-06T17:58:59Z

support decompressing fixed huffman blocks

Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe

huffman/src/bit_span.hpp

huffman/src/decode.hpp

src/decompress.hpp

codecov · 2024-03-09T15:29:28Z

Codecov Report

Attention: Patch coverage is 65.54622% with 41 lines in your changes are missing coverage. Please review.

Project coverage is 81.56%. Comparing base (1cd6846) to head (b29aaf3).

Files	Patch %	Lines
src/decompress.cpp	73.33%	18 Missing and 10 partials ⚠️
huffman/src/decode.hpp	7.14%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
- Coverage   82.40%   81.56%   -0.84%     
==========================================
  Files          16       17       +1     
  Lines         625      754     +129     
  Branches       39       59      +20     
==========================================
+ Hits          515      615     +100     
- Misses         92      113      +21     
- Partials       18       26       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huffman/src/bit_span.hpp

huffman/src/decode.hpp

oliverlee · 2024-03-09T19:56:09Z

huffman/src/decode.hpp

-      code_table_pos = code_table.begin();
-      current_code = code{};
-      continue;
+      return {(*found)->symbol, bits_read};


Suggested change

return {(*found)->symbol, bits_read};

return {(*found)->symbol, (*found)->bitsize()};

although I think it's better to change the return type. decode_result is imbuing "error" semantics in values with encoded_size = 0.

I think it's easier to understand when error semantics are explicitly conveyed with a return type such as optional<encoding> or expected<encoding, error_type?>. Although the end-iterator pattern is also common enough that could also work.

Ah yeah the suggested change is cleaner.
As for returning an optional or expected: I agree generally it's cleaner to signal errors explicitly, but:

I want this code to be fast, so having an extra byte of the return type might actually matter.

The type system allows encoded_size to be zero, but that's meaningless. So the type system is sort of forcing us to consider the possibility. May as well use that possibility to signal errors.

My suggestion is to use optional or expected for better semantics as a first pass.

As a second pass (or possibly in parallel with main development), we can replace std::optional with a compressed optional type that uses the "invalid" bit patterns to track if a value exists or does not.

There's a pretty good talk on this pattern here:
https://www.youtube.com/watch?v=MWBfmmg8-Yo

This appears to be a drop-in replacement:
https://github.com/Sedeniono/tiny-optional

I haven't looked at it much and I'm sure there's others as well.

But feel free to close if that's something you want to change later.

oliverlee · 2024-03-09T20:41:32Z

src/decompress.cpp

+    std::span<std::byte> dst,
+    std::ptrdiff_t& dst_written,


Suggested change

std::span<std::byte> dst,

std::ptrdiff_t& dst_written,

std::span<std::byte>& dst,

src/decompress.cpp

oliverlee · 2024-03-09T20:46:58Z

src/decompress.cpp

+    const auto lit_or_len_decoded = huffman::decode_one(len_table, src_bits);
+    if (not lit_or_len_decoded.encoded_size) {
+      return DecompressStatus::InvalidLitOrLen;
+    }


I think if src_bits doesn't have enough bits to decode a symbol - that would be "success" and this function should exit?

If decode_one is changed to return expected<>, maybe something like this?

const auto lit_or_len = huffman::decode_one(len_table, src_bits); if (not lit_or_len) { return status_for(lit_or_len.error()); }

If src_bits doesn't have enough bits, you do have have a partially constructed code and iterator in the code table. You could potentially keep those in order to skip some computation when more bits arrive - although it's certainly simpler to just restart.

Are you thinking of a chunked / streaming input?
If not, I don't understand why not having enough bits would be success?

Yeah I'm thinking of the chunked/streaming case.

Probably worth adding a TODO to handle later.

src/decompress.cpp

Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe

Change-Id: I9322af2674ab583f3cdb286ea06385587bacf670

Change-Id: Ib2a9cccc5df211363b2083b01d46043b329cd025

Change-Id: I5ec6e616b5ec5d33fb221e196fd9976702175bf8

Change-Id: Ib720fb2b200a72d54bae4246a42524ec9b1444b7

garymm

Thanks for the review. Let's talk on the phone or in person if there's still confusion or disagreement.

src/decompress.hpp

garymm · 2024-03-10T17:23:50Z

src/decompress.hpp

+///  X and Y, a string reference with <length = 5, distance = 2>
+///  adds X,Y,X,Y,X to the output stream."
+void copy_n(
+    std::span<const std::byte>::iterator src,


I changed the signature and I documented the precondition.
I used these types becuse:

This function is called in only one place and these types are convenient. I would inline this function but I broke it out so I can test it separately.

Isn't iterator arithmetic preferred over pointer arithmetic?

Do you still think there's a reason to use byte*?

src/decompress.cpp

garymm · 2024-03-10T19:06:22Z

src/decompress.cpp

        return DecompressStatus::DstTooSmall;
      }

-      std::copy_n(src_bits.byte_data(), len, dst.begin());
+      std::copy_n(src_bits.byte_data(), len, dst.begin() + dst_written);


If we advance dst, my guess would be it's UB to do dst.begin() - distance. I need to go backwards. If you have a better suggestion LMK.

huffman/src/bit_span.hpp

huffman/src/decode.hpp

garymm · 2024-03-10T22:09:29Z

huffman/src/decode.hpp

-      code_table_pos = code_table.begin();
-      current_code = code{};
-      continue;
+      return {(*found)->symbol, bits_read};


Ah yeah the suggested change is cleaner.
As for returning an optional or expected: I agree generally it's cleaner to signal errors explicitly, but:

I want this code to be fast, so having an extra byte of the return type might actually matter.

The type system allows encoded_size to be zero, but that's meaningless. So the type system is sort of forcing us to consider the possibility. May as well use that possibility to signal errors.

garymm · 2024-03-10T22:18:21Z

src/decompress.cpp

+    const auto lit_or_len_decoded = huffman::decode_one(len_table, src_bits);
+    if (not lit_or_len_decoded.encoded_size) {
+      return DecompressStatus::InvalidLitOrLen;
+    }


Are you thinking of a chunked / streaming input?
If not, I don't understand why not having enough bits would be success?

src/decompress.cpp

garymm · 2024-03-10T22:20:12Z

src/decompress.cpp

@@ -58,22 +179,27 @@ auto decompress(std::span<const std::byte> src, std::span<std::byte> dst)
        return DecompressStatus::SrcTooSmall;
      }

-      if (dst.size() < len) {
+      if (dst.size() - static_cast<std::size_t>(dst_written) < len) {


IIUC this comment is irrelevant unless we drpo the prefix of dst every time as suggested below.

Change-Id: I5c8a4f4c56f25ebe37a21caea97312309e19729c

oliverlee · 2024-03-24T02:58:05Z

src/decompress.cpp

+  std::uint16_t len{};
+  if (lit_or_len == detail::lit_or_len_max) {
+    len = detail::lit_or_len_max_decoded;
+  } else {


Seems like you can just return instead of construction + assignment of len?

if (lit_or_len == detail::lit_or_len_max) { return detail::lit_or_len_max_decoded; }

oliverlee · 2024-03-24T03:24:45Z

src/decompress.cpp

-      break;
-    }
-    if (lit_or_len > detail::lit_or_len_max) {
+    const auto maybe_lit_or_len =


This looks better to me.

I think this could still benefit from breaking out smaller functions - although if you think that's too much work, feel free to ignore.

const auto done = parse_lit_or_len(lit_or_len_decoded.symbol, src_bits) .and_then(decode_lit_or_len_into(dst)) .or_else(convert_error_status) if (done) { return *done; } // done is empty, so the the loop continues

where convert_error_status: ParseLitOrLenStatus -> optional<DecompressStatus>

return {in_place, e == EndOfBlock ? Success : InvalidLitOrLen};

and decode_lit_or_len_into uses the poor version of pattern matching available in C++:

auto decode_lit_or_len_into(std::span<std::byte>& dst) { return [&dst](variant<std::byte, std::uint16_t> lit_or_len) { return visit( overloaded({ [&dst](byte lit) { return decode_into(lit, dst); }, [&dst](uint16_t len) { return decode_into(lit, dst); } }), lit_or_len ); }; } // https://en.cppreference.com/w/cpp/utility/variant/visit template <class... Ts> struct overloaded : Ts... { using Ts::operator()...; };

and

auto decode_into(byte lit, span& dst) -> optional<DecompressStatus> { // lines 133-137 } auto decode_into(uint16_t len, span& dst) -> optional<DecompressStatus> { // lines 139-161 }

although that's a bit of a simplification - you'll also need to capture dst_written.

The different cases are broken out into different functions, although there's some logic inversion with variant::visit.

oliverlee

Some suggestions on refactoring the parsing function - I think a destination_buffer class could simplify that a bit as well.

oliverlee reviewed Jan 6, 2024

View reviewed changes

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 0386453 to ddaa091 Compare February 11, 2024 23:53

garymm changed the base branch from Ie061dff47a33fdc8995770c7a5f98129b8788556 to master February 11, 2024 23:53

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch 3 times, most recently from ea4ce7c to 206623b Compare February 12, 2024 15:33

garymm changed the base branch from master to I745f447fb22a843257d1ae211a130cd39dad4ccc February 12, 2024 15:33

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 206623b to 85dd9ef Compare March 9, 2024 15:10

garymm changed the title ~~WIP fixed huffman block type~~ support decompressing fixed huffman blocks Mar 9, 2024

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 85dd9ef to ed089c3 Compare March 9, 2024 15:22

garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from 08cec2e to ae388d3 Compare March 9, 2024 15:22

oliverlee reviewed Mar 9, 2024

View reviewed changes

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from ed089c3 to 9b8d407 Compare March 10, 2024 00:10

garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from ae388d3 to cb99492 Compare March 10, 2024 00:10

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 9b8d407 to 75b4fc2 Compare March 10, 2024 00:18

garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from cb99492 to 8783255 Compare March 10, 2024 00:18

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 75b4fc2 to 1b7d011 Compare March 10, 2024 00:27

garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch 2 times, most recently from cdd7b97 to 33fc1b4 Compare March 10, 2024 00:35

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 1b7d011 to 2488e0e Compare March 10, 2024 00:35

garymm marked this pull request as ready for review March 10, 2024 00:36

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 2488e0e to d3c35c5 Compare March 10, 2024 00:45

garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from 33fc1b4 to e18ab12 Compare March 10, 2024 00:45

Base automatically changed from I745f447fb22a843257d1ae211a130cd39dad4ccc to master March 10, 2024 00:50

garymm added 4 commits March 10, 2024 17:02

support decompressing fixed huffman blocks

6943db7

Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe

copy_n to copy_from_before

6983463

Change-Id: I9322af2674ab583f3cdb286ea06385587bacf670

extract parse_lit_or_len

f5faa04

Change-Id: Ib2a9cccc5df211363b2083b01d46043b329cd025

move pop_n to decompress.cpp

58f3d2e

Change-Id: I5ec6e616b5ec5d33fb221e196fd9976702175bf8

garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from d3c35c5 to 58f3d2e Compare March 10, 2024 20:03

use bitsize(), make things private again

c4416b7

Change-Id: Ib720fb2b200a72d54bae4246a42524ec9b1444b7

garymm commented Mar 10, 2024

View reviewed changes

move things from header to .cpp to make gcc happy

b29aaf3

Change-Id: I5c8a4f4c56f25ebe37a21caea97312309e19729c

oliverlee reviewed Mar 24, 2024

View reviewed changes

oliverlee approved these changes Mar 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support decompressing fixed huffman blocks #124

support decompressing fixed huffman blocks #124

garymm commented Jan 6, 2024 •

edited

Loading

codecov bot commented Mar 9, 2024 •

edited

Loading

oliverlee Mar 9, 2024

garymm Mar 10, 2024

oliverlee Mar 24, 2024 •

edited

Loading

oliverlee Mar 24, 2024

oliverlee Mar 9, 2024

oliverlee Mar 9, 2024

garymm Mar 10, 2024

oliverlee Mar 24, 2024

oliverlee Mar 24, 2024

garymm left a comment

garymm Mar 10, 2024

garymm Mar 10, 2024

garymm Mar 10, 2024

garymm Mar 10, 2024

garymm Mar 10, 2024

oliverlee Mar 24, 2024

oliverlee Mar 24, 2024

oliverlee left a comment

	return {(*found)->symbol, bits_read};
	return {(found)->symbol, (found)->bitsize()};

	std::span<std::byte> dst,
	std::ptrdiff_t& dst_written,
	std::span<std::byte>& dst,

support decompressing fixed huffman blocks #124

Are you sure you want to change the base?

support decompressing fixed huffman blocks #124

Conversation

garymm commented Jan 6, 2024 • edited Loading

codecov bot commented Mar 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverlee Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garymm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverlee left a comment

Choose a reason for hiding this comment

garymm commented Jan 6, 2024 •

edited

Loading

codecov bot commented Mar 9, 2024 •

edited

Loading

oliverlee Mar 24, 2024 •

edited

Loading