feat: Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec #7589

marin-ma · 2023-11-15T10:36:30Z

Initial implementation of the proposed unified compression API. This patch defines the Compression Codec API inspired by Apache Arrow and adds missing functions used in Velox. Adds support for codecs LZ4_FRAME, LZ4_RAW, and LZ4_HADOOP. Include unit tests.

Discussion: #7471

netlify · 2023-11-15T10:36:36Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`0e886a4`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/662a23924b9cec0008a467bf

marin-ma · 2023-11-16T02:59:32Z

@mbasmanova Could you help to review? Thanks!

FelixYBW · 2023-11-16T10:56:05Z

@rui-mo

velox/common/compression/v2/Compression.h

velox/common/compression/v2/HadoopCompressionFormat.cpp

velox/common/compression/v2/Lz4Compression.cpp

rui-mo · 2023-11-16T12:46:21Z

velox/common/compression/v2/Lz4Compression.cpp

+
+  ret = LZ4F_createCompressionContext(&ctx_, LZ4F_VERSION);
+  if (LZ4F_isError(ret)) {
+    lz4Error(ret, "LZ4 init failed: ");


Any missing content for the error message?

lz4Error will try to expand the error message for the return code. Switched the param order to make it clear for the reader.

velox/common/compression/v2/Lz4Compression.cpp

velox/common/compression/v2/tests/CompressionTest.cpp

marin-ma · 2023-11-17T02:33:56Z

@mbasmanova Could you help to review? Thanks!

yaqi-zhao · 2023-11-17T04:08:43Z

Asynchronous Compression API is not is this PR, right?

marin-ma · 2023-11-17T04:12:05Z

Asynchronous Compression API is not is this PR, right?

@yaqi-zhao Yes. As suggested here #7471 (comment) We shall add synchronous API first.

george-gu-2021 · 2023-11-17T05:30:19Z

Asynchronous Compression API is not is this PR, right?

@yaqi-zhao Yes. As suggested here #7471 (comment) We shall add synchronous API first.

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

FelixYBW · 2023-11-17T05:39:56Z

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

I don't think so. Let's have good design then implement the features step by step as we talked. It's bad idea to merge an unmature code firstly then refactor it. it doesn't block the customer PoC if they want to use Yaqi's draft one, currently the issue blocked the customer PoC is how to generate the data which IAA can decompress, not the PRs.

yaqi-zhao · 2023-11-17T05:44:37Z

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

I don't think so. Let's have good design then implement the features step by step as we talked. It's bad idea to merge an unmature code firstly then refactor it. it doesn't block the customer PoC if they want to use Yaqi's draft one, currently the issue blocked the customer PoC is how to generate the data which IAA can decompress, not the PRs.

@FelixYBW The blocked issue is not the data generation, #7437 is merged and there is no block in this issue.

majetideepak · 2023-11-28T12:26:29Z

@marin-ma Some high-level comments.
Why create a new folder V2? Why not update the existing compression.h/.cpp files?
Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel?
I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review.
What do you think?

marin-ma · 2023-11-29T10:16:23Z

@marin-ma Some high-level comments. Why create a new folder V2? Why not update the existing compression.h/.cpp files? Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel? I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review. What do you think?

@majetideepak Thank you for the review.

Why create a new folder V2? Why not update the existing compression.h/.cpp files?

This was based on the discussion in #7471 (comment) The new folder "v2" is intended to introduce the new API first. Next, I will replace the compression API used in parquet and dwio module with the new API. Meanwhile, common/compression will be replaced by common/compression/v2.

Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel?

This is a user-level API, which can be useful when users want to set different compression levels, such as in a Parquet writer. Given that the minimum and maximum compression levels can vary among different compression codecs, maximumCompressionLevel provides users with a boundary to ensure that a valid compression level is used. However, this approach also makes the API complicated. If the APIs related to "compression level" are unnecessary, I can remove them, which would then require users to refer to the documentation of the compression library for this information.

I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review.

This should indeed make the review process more straightforward. However, I'm unsure about the practicality of replacing only one codec with the new API while keeping the original ones for the rest. Does this mean that we should temporarily disable other codecs until they can be integrated individually? Or perhaps you have a more efficient suggestion for this replacement process?

majetideepak · 2023-12-01T06:38:06Z

This was based on the discussion in #7471 (comment)

Thanks for this pointer. Let's continue with the steps outlined in that issue. I will make a pass today.

FelixYBW · 2023-12-01T19:16:06Z

@pedroerp @mbasmanova Can you review the PR?

majetideepak

Some comments on the main Compression.h/cpp API.
I will look at the Hadoop and Lz4 compression code next.

majetideepak · 2023-11-30T04:27:17Z

velox/common/compression/v2/Compression.h

+    uint64_t bytesWritten;
+    bool outputTooSmall;
+  };
+  struct FlushResult {


nit: add a new line above struct FlushResult and struct EndResult.

majetideepak · 2023-11-30T04:30:10Z

velox/common/compression/v2/Compression.h

+  };
+
+  /// Compress some input.
+  /// If bytes_read is 0 on return, then a larger output buffer should be


CompressResult.bytesRead

majetideepak · 2023-11-30T04:33:50Z

velox/common/compression/v2/Compression.h

+      uint8_t* output) = 0;
+
+  /// Flush part of the compressed output.
+  /// If outputTooSmall is true on return, flush() should be called again


FlushResult.outputTooSmall

majetideepak · 2023-11-30T04:34:12Z

velox/common/compression/v2/Compression.h

+  virtual FlushResult flush(uint64_t outputLength, uint8_t* output) = 0;
+
+  /// End compressing, doing whatever is necessary to end the stream.
+  /// If outputTooSmall is true on return, end() should be called again


EndResult.outputTooSmall

majetideepak · 2023-11-30T04:50:39Z

velox/common/compression/v2/Compression.h

+  /// If outputTooSmall is true on return, end() should be called again
+  /// with a larger buffer. Otherwise, the Compressor should not be used
+  /// anymore.
+  /// end() implies flush().


Can you clarify what end() implies flush() means?

majetideepak · 2023-12-04T12:41:57Z

velox/common/compression/v2/Compression.cpp

+  }
+  auto actualLength =
+      doGetUncompressedLength(inputLength, input, uncompressedLength);
+  if (actualLength) {


actualLength > 0

majetideepak · 2023-12-04T12:42:20Z

velox/common/compression/v2/Compression.cpp

+  auto actualLength =
+      doGetUncompressedLength(inputLength, input, uncompressedLength);
+  if (actualLength) {
+    if (uncompressedLength) {


uncompressedLength > 0

majetideepak · 2023-12-04T12:43:50Z

velox/common/compression/v2/Compression.cpp

+      VELOX_USER_CHECK_EQ(
+          *actualLength,
+          *uncompressedLength,
+          "Invalid uncompressed length: {}.",


clarify that expected uncompressed length {uncompressedLength} = {actualLength}

majetideepak · 2023-12-04T12:46:23Z

velox/common/compression/v2/Compression.h

+  /// be written in this call will be written in subsequent calls to this
+  /// function. This is useful when fixed-size compression blocks are required
+  /// by the caller.
+  /// Note: Only Gzip and Zstd codec supports this function.


Should we have an API supportsPartialCompression?

majetideepak · 2023-12-04T12:53:09Z

velox/common/compression/v2/Compression.h

+  /// function. This is useful when fixed-size compression blocks are required
+  /// by the caller.
+  /// Note: Only Gzip and Zstd codec supports this function.
+  virtual uint64_t compressPartial(


The API name compressPartial is misleading since this is still a one-shot compression.
But I can't think of an alternative name either :)

yingsu00 · 2023-12-04T13:29:20Z

velox/common/compression/Compression.cpp

@@ -76,6 +76,10 @@ std::string compressionKindToString(CompressionKind kind) {
      return "lz4";
    case CompressionKind_GZIP:
      return "gzip";
+    case CompressionKind_LZ4RAW:


nit: can you please group these two with the lz4 above?

yingsu00 · 2023-12-04T13:54:42Z

velox/common/compression/v2/CMakeLists.txt

@@ -0,0 +1,29 @@
+# Copyright (c) Facebook, Inc. and its affiliates.


Is v2 folder supposed to replace its parent in the future? This structure is confusing, for example, you put Lz4Compression in this folder, while the original lzoDecompressor is in the parent directory. Also, how are you going to organize compressionKindToCodec, etc? If the intention is to replace the current compress/decompress interface, it would be better to just make changes to the parent (velox/common/compression) folder, so we can see what the structure of the interfaces are.

yingsu00 · 2023-12-04T13:55:14Z

velox/common/compression/v2/Compression.h

+  /// If bytes_read is 0 on return, then a larger output buffer should be
+  /// supplied.
+  virtual CompressResult compress(
+      uint64_t inputLength,


I think Velox convention is to have the array first, length second. Same for both input and output.

yingsu00 · 2023-12-04T14:20:04Z

velox/common/compression/v2/Compression.h

+    std::numeric_limits<int32_t>::min();
+
+// Streaming compressor interface.
+class Compressor {


I actually think the naming of Compressor and Codec is a bit confusing unless you are familiar with Arrow. In velox::common, there is "encode" folder which contains integer encoding/decoding. Compression codecs don't belong there, but they are also named as codec. And people won't think "Compressor" is actually for streaming compression use. With Velox naming conventions, I think it's better to name them as StreamingCompressor and Compressor. Actually even in Arrow, they would call a codec "decompressor" or "compressor". e.g. in column_reader

decompressor_ = GetCodec(codec);

yingsu00 · 2023-12-04T14:21:31Z

velox/common/compression/v2/tests/CompressionTest.cpp

+  uint64_t compressedSize = 0;
+  compressed.resize(10);
+  bool doFlush = false;
+  // Generate small random input buffer size.


Add an empty line above all comments in this function to improve readability

yingsu00 · 2023-12-04T14:21:59Z

velox/common/compression/v2/tests/CompressionTest.cpp

+  uint64_t remaining = compressed.size();
+  uint64_t decompressedSize = 0;
+  decompressed.resize(10);
+  // Generate small random input buffer size.


Add an empty line above all comments in this function to improve readability

yingsu00 · 2023-12-04T14:22:36Z

velox/common/compression/v2/tests/CompressionTest.cpp

+
+// Check the streaming compressor against one-shot decompression.
+void checkStreamingCompressor(Codec* codec, const std::vector<uint8_t>& data) {
+  // Run streaming compression.


Add an empty line above all comments in this function to improve readability. Same for the next function

yingsu00 · 2023-12-04T14:32:40Z

This was based on the discussion in #7471 (comment) The new folder "v2" is intended to introduce the new API first. Next, I will replace the compression API used in parquet and dwio module with the new API. Meanwhile, common/compression will be replaced by common/compression/v2.

I actually find this approach not very crystal clear. For one thing, we don't know which file will be in which folder. Also I think the Compression.h/cpp should be merged with the ones in the velox/common/compression folder. I think you can achieve the same goal without disrupting Velox code base by just putting the code to where they should belong to.

marin-ma · 2024-11-25T09:19:57Z

@pedroerp Could you help to review again? Thanks!

assignUser · 2024-11-27T11:17:43Z

velox/common/compression/CMakeLists.txt

+if(VELOX_ENABLE_COMPRESSION_LZ4)
+  list(APPEND VELOX_COMMON_COMPRESSION_SRCS Lz4Compression.cpp
+       HadoopCompressionFormat.cpp)
+  list(APPEND VELOX_COMMON_COMPRESSION_LINK_LIBS lz4::lz4)
+endif()
+


Please avoid using variables in this way. Use velox_target_sources and velox_link_libraries instead. Both can be called multiple times on the same target.

assignUser · 2024-11-27T11:20:31Z

CMakeLists.txt

+if(VELOX_ENABLE_COMPRESSION_LZ4)
+  add_definitions(-DVELOX_ENABLE_COMPRESSION_LZ4)
+endif()


Move this into common/compression and use velox_compile_definitions(velox_common_compression PUBLIC VELOX_ENABLE_COMPRESSION_LZ4) (-D is stripped and readded anyway)

marin-ma · 2024-11-28T02:44:51Z

@assignUser Could you help review this again? It seems that the UT failures are not related to this PR. Thanks!

assignUser · 2024-11-28T18:28:49Z

velox/common/compression/CMakeLists.txt

+                HadoopCompressionFormat.cpp)
+  velox_link_libraries(velox_common_compression PUBLIC lz4::lz4)
+  velox_compile_definitions(velox_common_compression
+                            PRIVATE VELOX_ENABLE_COMPRESSION_LZ4)


Ah it's not used in a header so PRIVATE is better, good catch!

majetideepak · 2024-12-02T18:42:19Z

CMakeLists.txt

@@ -117,6 +117,7 @@ option(VELOX_ENABLE_PARQUET "Enable Parquet support" OFF)
 option(VELOX_ENABLE_ARROW "Enable Arrow support" OFF)
 option(VELOX_ENABLE_REMOTE_FUNCTIONS "Enable remote function support" OFF)
 option(VELOX_ENABLE_CCACHE "Use ccache if installed." ON)
+option(VELOX_ENABLE_COMPRESSION_LZ4 "Enable Lz4 compression support." OFF)


We probably do not need this option.

marin-ma · 2024-12-05T01:45:01Z

@pedroerp Could you help to review again? Meanwhile, I'm working on the next patch to add more compression codecs. Thanks!

marin-ma · 2024-12-10T15:48:35Z

@pedroerp I have addressed the comments regarding the Expect and module separation. Could you please review it again? Thank you!

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2023

marin-ma force-pushed the unify-compression-api-lz4 branch 4 times, most recently from 70fd61a to 26dbdea Compare November 16, 2023 01:45

marin-ma mentioned this pull request Nov 16, 2023

Add gzip/zlib/zstd/snappy/lzo to unified compression API #7603

Closed

rui-mo reviewed Nov 16, 2023

View reviewed changes

marin-ma force-pushed the unify-compression-api-lz4 branch from 60d52b2 to 6f462b9 Compare November 17, 2023 01:56

marin-ma mentioned this pull request Nov 21, 2023

Unify the compression API in Velox #7471

Open

FelixYBW mentioned this pull request Nov 27, 2023

gzip pre-decompress w/IAA #6176

Closed

marin-ma force-pushed the unify-compression-api-lz4 branch from 6f462b9 to 22f7435 Compare December 1, 2023 06:35

marin-ma force-pushed the unify-compression-api-lz4 branch from 7232970 to 8fb0bea Compare December 1, 2023 14:50

majetideepak reviewed Dec 4, 2023

View reviewed changes

yingsu00 reviewed Dec 4, 2023

View reviewed changes

marin-ma force-pushed the unify-compression-api-lz4 branch 2 times, most recently from be335e4 to 6e1b203 Compare December 7, 2023 04:00

majetideepak mentioned this pull request Nov 18, 2024

refactor: Update Base64 as non-throwing API #11149

Open

marin-ma changed the title ~~Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec~~ feat: Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec Nov 22, 2024

marin-ma force-pushed the unify-compression-api-lz4 branch from 25a90c5 to 8253b53 Compare November 25, 2024 01:29

assignUser requested changes Nov 27, 2024

View reviewed changes

marin-ma force-pushed the unify-compression-api-lz4 branch from 0724c0b to 41dd6e7 Compare November 28, 2024 01:15

assignUser approved these changes Nov 28, 2024

View reviewed changes

majetideepak reviewed Dec 2, 2024

View reviewed changes

marin-ma force-pushed the unify-compression-api-lz4 branch from 4f69737 to 6963bb7 Compare December 4, 2024 07:29

marin-ma added 14 commits December 10, 2024 11:30

add compression v2 API and lz4_frame/lz4_raw/lz4_hadoop codec

8c24ec5

use folly::Expected as return code

82557c0

fix build

e6c47ba

update Expected usage

09f5a23

address comments

512be5f

fix format

b9a996e

address comments

1c13b33

address comments

7f13443

rebase

462bfc8

format

349a4a3

refine error message

05d1a96

address comments

1690bc9

fix

48546e0

fix up

56f9540

marin-ma force-pushed the unify-compression-api-lz4 branch from 6963bb7 to 56f9540 Compare December 10, 2024 11:36

add throws

472cfe9

marin-ma force-pushed the unify-compression-api-lz4 branch from 1aa55c1 to 472cfe9 Compare December 11, 2024 15:50

nit

b76c7c9

		@@ -0,0 +1,29 @@
		# Copyright (c) Facebook, Inc. and its affiliates.

feat: Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec #7589

Are you sure you want to change the base?

feat: Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec #7589

Conversation

marin-ma commented Nov 15, 2023

netlify bot commented Nov 15, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

marin-ma commented Nov 16, 2023

FelixYBW commented Nov 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Nov 17, 2023

yaqi-zhao commented Nov 17, 2023

marin-ma commented Nov 17, 2023

george-gu-2021 commented Nov 17, 2023

FelixYBW commented Nov 17, 2023

yaqi-zhao commented Nov 17, 2023

majetideepak commented Nov 28, 2023

marin-ma commented Nov 29, 2023

majetideepak commented Dec 1, 2023

FelixYBW commented Dec 1, 2023

majetideepak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 commented Dec 4, 2023

marin-ma commented Nov 25, 2024

assignUser Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Nov 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Dec 5, 2024

marin-ma commented Dec 10, 2024

netlify bot commented Nov 15, 2023 •

edited

Loading

assignUser Nov 27, 2024 •

edited

Loading