Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316

Merged
merged 23 commits into from
Nov 8, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 23, 2023

Description

During the review of #14100 there was a suggestion to add a test of writing using cudf and then reading the resulting file back with pyarrow. This PR adds the necessary python bindings to perform this test.

NOTE: there is currently an issue with encoding 32-bit values where the deltas exceed 32-bits. parquet-mr and arrow truncate the deltas for the INT32 physical type and allow values to overflow, whereas cudf currently uses 64-bit deltas, which avoids the overflow, but can result in requiring 33-bits when encoding. The current cudf behavior is allowed by the specification (and in fact is readable by parquet-mr), but using the extra bit is not in the Parquet spirit of least output file size. This will be addressed in follow-on work.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested review from a team as code owners October 23, 2023 18:57
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 23, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Oct 23, 2023
@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels Oct 23, 2023
@vuule
Copy link
Contributor

vuule commented Oct 26, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a rounding concern

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
Co-authored-by: Vukasin Milovanovic <[email protected]>
@vuule
Copy link
Contributor

vuule commented Oct 26, 2023

/ok to test

@GregoryKimball
Copy link
Contributor

@galipremsagar Would you please take a look at this testing improvement?

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a cuIO dev, but this looks good to me.

A minor nitpick regarding the naming of a function, but that's unrelated to this change.

@@ -191,6 +199,8 @@ cdef extern from "cudf/io/parquet.hpp" namespace "cudf::io" nogil:
void set_row_group_size_rows(size_type val) except +
void set_max_page_size_bytes(size_t val) except +
void set_max_page_size_rows(size_type val) except +
void enable_write_v2_headers(bool val) except +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it isn't the fault of this current PR, but one does wish enable_write_v2_headers were named set_write_v2_headers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use enable_ for bool options, so this should be consistent (for better or for worse, apparently).

@@ -6370,6 +6370,8 @@ def to_parquet(
max_page_size_rows=None,
storage_options=None,
return_metadata=False,
use_dictionary=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document these two parameters here:

_docstring_to_parquet = """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @galipremsagar. I would have never thought to look there for the docstring 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time hearing about it as well 🤷‍♂️ (just don't git blame ioutils.py :P )

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
@etseidl etseidl requested a review from galipremsagar November 7, 2023 17:48
@vuule
Copy link
Contributor

vuule commented Nov 7, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 7, 2023
@vuule
Copy link
Contributor

vuule commented Nov 8, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Nov 8, 2023

/merge

@rapids-bot rapids-bot bot merged commit c4e6c09 into rapidsai:branch-23.12 Nov 8, 2023
61 checks passed
@etseidl etseidl deleted the delta_encode_python branch November 8, 2023 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants