Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec #7589

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

marin-ma
Copy link
Contributor

Initial implementation of the proposed unified compression API. This patch defines the Compression Codec API inspired by Apache Arrow and adds missing functions used in Velox. Adds support for codecs LZ4_FRAME, LZ4_RAW, and LZ4_HADOOP. Include unit tests.

Discussion: #7471

Copy link

netlify bot commented Nov 15, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0e886a4
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/662a23924b9cec0008a467bf

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2023
@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch 4 times, most recently from 70fd61a to 26dbdea Compare November 16, 2023 01:45
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review? Thanks!

@FelixYBW
Copy link
Contributor

velox/common/compression/v2/Compression.h Outdated Show resolved Hide resolved
velox/common/compression/v2/HadoopCompressionFormat.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/HadoopCompressionFormat.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/HadoopCompressionFormat.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/Lz4Compression.cpp Outdated Show resolved Hide resolved

ret = LZ4F_createCompressionContext(&ctx_, LZ4F_VERSION);
if (LZ4F_isError(ret)) {
lz4Error(ret, "LZ4 init failed: ");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any missing content for the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lz4Error will try to expand the error message for the return code. Switched the param order to make it clear for the reader.

velox/common/compression/v2/Lz4Compression.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/Lz4Compression.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/tests/CompressionTest.cpp Outdated Show resolved Hide resolved
@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 60d52b2 to 6f462b9 Compare November 17, 2023 01:56
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review? Thanks!

@yaqi-zhao
Copy link
Contributor

Asynchronous Compression API is not is this PR, right?

@marin-ma
Copy link
Contributor Author

Asynchronous Compression API is not is this PR, right?

@yaqi-zhao Yes. As suggested here #7471 (comment) We shall add synchronous API first.

@george-gu-2021
Copy link

Asynchronous Compression API is not is this PR, right?

@yaqi-zhao Yes. As suggested here #7471 (comment) We shall add synchronous API first.

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

@FelixYBW
Copy link
Contributor

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

I don't think so. Let's have good design then implement the features step by step as we talked. It's bad idea to merge an unmature code firstly then refactor it. it doesn't block the customer PoC if they want to use Yaqi's draft one, currently the issue blocked the customer PoC is how to generate the data which IAA can decompress, not the PRs.

@yaqi-zhao
Copy link
Contributor

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

I don't think so. Let's have good design then implement the features step by step as we talked. It's bad idea to merge an unmature code firstly then refactor it. it doesn't block the customer PoC if they want to use Yaqi's draft one, currently the issue blocked the customer PoC is how to generate the data which IAA can decompress, not the PRs.

@FelixYBW The blocked issue is not the data generation, #7437 is merged and there is no block in this issue.

@majetideepak
Copy link
Collaborator

@marin-ma Some high-level comments.
Why create a new folder V2? Why not update the existing compression.h/.cpp files?
Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel?
I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review.
What do you think?

@marin-ma
Copy link
Contributor Author

@marin-ma Some high-level comments. Why create a new folder V2? Why not update the existing compression.h/.cpp files? Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel? I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review. What do you think?

@majetideepak Thank you for the review.

Why create a new folder V2? Why not update the existing compression.h/.cpp files?

This was based on the discussion in #7471 (comment) The new folder "v2" is intended to introduce the new API first. Next, I will replace the compression API used in parquet and dwio module with the new API. Meanwhile, common/compression will be replaced by common/compression/v2.

Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel?

This is a user-level API, which can be useful when users want to set different compression levels, such as in a Parquet writer. Given that the minimum and maximum compression levels can vary among different compression codecs, maximumCompressionLevel provides users with a boundary to ensure that a valid compression level is used. However, this approach also makes the API complicated. If the APIs related to "compression level" are unnecessary, I can remove them, which would then require users to refer to the documentation of the compression library for this information.

I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review.

This should indeed make the review process more straightforward. However, I'm unsure about the practicality of replacing only one codec with the new API while keeping the original ones for the rest. Does this mean that we should temporarily disable other codecs until they can be integrated individually? Or perhaps you have a more efficient suggestion for this replacement process?

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 6f462b9 to 22f7435 Compare December 1, 2023 06:35
@majetideepak
Copy link
Collaborator

This was based on the discussion in #7471 (comment)

Thanks for this pointer. Let's continue with the steps outlined in that issue. I will make a pass today.

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 7232970 to 8fb0bea Compare December 1, 2023 14:50
@FelixYBW
Copy link
Contributor

FelixYBW commented Dec 1, 2023

@pedroerp @mbasmanova Can you review the PR?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the main Compression.h/cpp API.
I will look at the Hadoop and Lz4 compression code next.

uint64_t bytesWritten;
bool outputTooSmall;
};
struct FlushResult {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a new line above struct FlushResult and struct EndResult.

};

/// Compress some input.
/// If bytes_read is 0 on return, then a larger output buffer should be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompressResult.bytesRead

uint8_t* output) = 0;

/// Flush part of the compressed output.
/// If outputTooSmall is true on return, flush() should be called again
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlushResult.outputTooSmall

virtual FlushResult flush(uint64_t outputLength, uint8_t* output) = 0;

/// End compressing, doing whatever is necessary to end the stream.
/// If outputTooSmall is true on return, end() should be called again
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndResult.outputTooSmall

/// If outputTooSmall is true on return, end() should be called again
/// with a larger buffer. Otherwise, the Compressor should not be used
/// anymore.
/// end() implies flush().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what end() implies flush() means?

}
auto actualLength =
doGetUncompressedLength(inputLength, input, uncompressedLength);
if (actualLength) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actualLength > 0

auto actualLength =
doGetUncompressedLength(inputLength, input, uncompressedLength);
if (actualLength) {
if (uncompressedLength) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncompressedLength > 0

VELOX_USER_CHECK_EQ(
*actualLength,
*uncompressedLength,
"Invalid uncompressed length: {}.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that expected uncompressed length {uncompressedLength} = {actualLength}

/// be written in this call will be written in subsequent calls to this
/// function. This is useful when fixed-size compression blocks are required
/// by the caller.
/// Note: Only Gzip and Zstd codec supports this function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an API supportsPartialCompression?

/// function. This is useful when fixed-size compression blocks are required
/// by the caller.
/// Note: Only Gzip and Zstd codec supports this function.
virtual uint64_t compressPartial(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API name compressPartial is misleading since this is still a one-shot compression.
But I can't think of an alternative name either :)

@@ -76,6 +76,10 @@ std::string compressionKindToString(CompressionKind kind) {
return "lz4";
case CompressionKind_GZIP:
return "gzip";
case CompressionKind_LZ4RAW:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please group these two with the lz4 above?

@@ -0,0 +1,29 @@
# Copyright (c) Facebook, Inc. and its affiliates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is v2 folder supposed to replace its parent in the future? This structure is confusing, for example, you put Lz4Compression in this folder, while the original lzoDecompressor is in the parent directory. Also, how are you going to organize compressionKindToCodec, etc? If the intention is to replace the current compress/decompress interface, it would be better to just make changes to the parent (velox/common/compression) folder, so we can see what the structure of the interfaces are.

/// If bytes_read is 0 on return, then a larger output buffer should be
/// supplied.
virtual CompressResult compress(
uint64_t inputLength,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Velox convention is to have the array first, length second. Same for both input and output.

std::numeric_limits<int32_t>::min();

// Streaming compressor interface.
class Compressor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think the naming of Compressor and Codec is a bit confusing unless you are familiar with Arrow. In velox::common, there is "encode" folder which contains integer encoding/decoding. Compression codecs don't belong there, but they are also named as codec. And people won't think "Compressor" is actually for streaming compression use. With Velox naming conventions, I think it's better to name them as StreamingCompressor and Compressor. Actually even in Arrow, they would call a codec "decompressor" or "compressor". e.g. in column_reader

decompressor_ = GetCodec(codec);

uint64_t compressedSize = 0;
compressed.resize(10);
bool doFlush = false;
// Generate small random input buffer size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line above all comments in this function to improve readability

uint64_t remaining = compressed.size();
uint64_t decompressedSize = 0;
decompressed.resize(10);
// Generate small random input buffer size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line above all comments in this function to improve readability


// Check the streaming compressor against one-shot decompression.
void checkStreamingCompressor(Codec* codec, const std::vector<uint8_t>& data) {
// Run streaming compression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line above all comments in this function to improve readability. Same for the next function

@yingsu00
Copy link
Collaborator

yingsu00 commented Dec 4, 2023

This was based on the discussion in #7471 (comment) The new folder "v2" is intended to introduce the new API first. Next, I will replace the compression API used in parquet and dwio module with the new API. Meanwhile, common/compression will be replaced by common/compression/v2.

I actually find this approach not very crystal clear. For one thing, we don't know which file will be in which folder. Also I think the Compression.h/cpp should be merged with the ones in the velox/common/compression folder. I think you can achieve the same goal without disrupting Velox code base by just putting the code to where they should belong to.

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch 2 times, most recently from be335e4 to 6e1b203 Compare December 7, 2023 04:00
@marin-ma marin-ma changed the title Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec feat: Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec Nov 22, 2024
@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 25a90c5 to 8253b53 Compare November 25, 2024 01:29
@marin-ma
Copy link
Contributor Author

@pedroerp Could you help to review again? Thanks!

Comment on lines 22 to 27
if(VELOX_ENABLE_COMPRESSION_LZ4)
list(APPEND VELOX_COMMON_COMPRESSION_SRCS Lz4Compression.cpp
HadoopCompressionFormat.cpp)
list(APPEND VELOX_COMMON_COMPRESSION_LINK_LIBS lz4::lz4)
endif()

Copy link
Collaborator

@assignUser assignUser Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using variables in this way. Use velox_target_sources and velox_link_libraries instead. Both can be called multiple times on the same target.

CMakeLists.txt Outdated
Comment on lines 279 to 281
if(VELOX_ENABLE_COMPRESSION_LZ4)
add_definitions(-DVELOX_ENABLE_COMPRESSION_LZ4)
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into common/compression and use velox_compile_definitions(velox_common_compression PUBLIC VELOX_ENABLE_COMPRESSION_LZ4) (-D is stripped and readded anyway)

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 0724c0b to 41dd6e7 Compare November 28, 2024 01:15
@marin-ma
Copy link
Contributor Author

@assignUser Could you help review this again? It seems that the UT failures are not related to this PR. Thanks!

HadoopCompressionFormat.cpp)
velox_link_libraries(velox_common_compression PUBLIC lz4::lz4)
velox_compile_definitions(velox_common_compression
PRIVATE VELOX_ENABLE_COMPRESSION_LZ4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it's not used in a header so PRIVATE is better, good catch!

@@ -117,6 +117,7 @@ option(VELOX_ENABLE_PARQUET "Enable Parquet support" OFF)
option(VELOX_ENABLE_ARROW "Enable Arrow support" OFF)
option(VELOX_ENABLE_REMOTE_FUNCTIONS "Enable remote function support" OFF)
option(VELOX_ENABLE_CCACHE "Use ccache if installed." ON)
option(VELOX_ENABLE_COMPRESSION_LZ4 "Enable Lz4 compression support." OFF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably do not need this option.

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 4f69737 to 6963bb7 Compare December 4, 2024 07:29
@marin-ma
Copy link
Contributor Author

marin-ma commented Dec 5, 2024

@pedroerp Could you help to review again? Meanwhile, I'm working on the next patch to add more compression codecs. Thanks!

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 6963bb7 to 56f9540 Compare December 10, 2024 11:36
@marin-ma
Copy link
Contributor Author

@pedroerp I have addressed the comments regarding the Expect and module separation. Could you please review it again? Thank you!

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch from 1aa55c1 to 472cfe9 Compare December 11, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants